Minimal logical decoding on standbys

Started by Andres Freundabout 7 years ago360 messages

andres@anarazel.de

about 7 years ago

1 attachment(s)

Hi,

Craig previously worked on $subject, see thread [1]/messages/by-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=XCg@mail.gmail.com. A bunch of the
prerequisite features from that and other related threads have been
integrated into PG. What's missing is actually allowing logical
decoding on a standby. The latest patch from that thread does that [2]https://archives.postgresql.org/message-id/CAMsr%2BYEbS8ZZ%2Bw18j7OPM2MZEeDtGN9wDVF68%3DMzpeW%3DKRZZ9Q%40mail.gmail.com,
but unfortunately hasn't been updated after slipping v10.

The biggest remaining issue to allow it is that the catalog xmin on the
primary has to be above the catalog xmin horizon of all slots on the
standby. The patch in [2]https://archives.postgresql.org/message-id/CAMsr%2BYEbS8ZZ%2Bw18j7OPM2MZEeDtGN9wDVF68%3DMzpeW%3DKRZZ9Q%40mail.gmail.com does so by periodically logging a new record
that announces the current catalog xmin horizon. Additionally it
checks that hot_standby_feedback is enabled when doing logical decoding
from a standby.

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements. I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts. As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

I also don't think we should actually enforce hot_standby_feedback being
enabled - there's use-cases where that's not convenient, and it's not
bullet proof anyway (can be enabled/disabled without using logical
decoding inbetween). I think when there's a conflict we should have the
HINT mention that hs_feedback can be used to prevent such conflicts,
that ought to be enough.

Attached is a rough draft patch. If we were to go for this approach,
we'd obviously need to improve the actual conflict handling against
slots - right now it just logs a WARNING and retries shortly after.

I think there's currently one hole in this approach. Nbtree (and other
index types, which are pretty unlikely to matter here) have this logic
to handle snapshot conflicts for single-page deletions:

/*
* If we have any conflict processing to do, it must happen before we
* update the page.
*
* Btree delete records can conflict with standby queries. You might
* think that vacuum records would conflict as well, but we've handled
* that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
* cleaned by the vacuum of the heap and so we can resolve any conflicts
* just once when that arrives. After that we know that no conflicts
* exist from individual btree vacuum records on that index.
*/
if (InHotStandby)
{
TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
RelFileNode rnode;

XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);

ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
xlrec->onCatalogTable, rnode);
}

I.e. we get the latest removed xid from the heap, which has the
following logic:
/*
* If there's nothing running on the standby we don't need to derive a
* full latestRemovedXid value, so use a fast path out of here. This
* returns InvalidTransactionId, and so will conflict with all HS
* transactions; but since we just worked out that that's zero people,
* it's OK.
*
* XXX There is a race condition here, which is that a new backend might
* start just after we look. If so, it cannot need to conflict, but this
* coding will result in throwing a conflict anyway.
*/
if (CountDBBackends(InvalidOid) == 0)
return latestRemovedXid;

/*
* In what follows, we have to examine the previous state of the index
* page, as well as the heap page(s) it points to. This is only valid if
* WAL replay has reached a consistent database state; which means that
* the preceding check is not just an optimization, but is *necessary*. We
* won't have let in any user sessions before we reach consistency.
*/
if (!reachedConsistency)
elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data");

so we wouldn't get a correct xid when not if nobody is connected to a
database (and by implication when not yet consistent).

I'm wondering if it's time to move the latestRemovedXid computation for
this type of record to the primary - it's likely to be cheaper there and
avoids this kind of complication. Secondarily, it'd have the advantage
of making pluggable storage integration easier - there we have the
problem that we don't know which type of relation we're dealing with
during recovery, so such lookups make pluggability harder (zheap just
adds extra flags to signal that, but that's not extensible).

Another alternative would be to just prevent such index deletions for
catalog tables when wal_level = logical.

If we were to go with this approach, there'd be at least the following
tasks:
- adapt tests from [2]https://archives.postgresql.org/message-id/CAMsr%2BYEbS8ZZ%2Bw18j7OPM2MZEeDtGN9wDVF68%3DMzpeW%3DKRZZ9Q%40mail.gmail.com
- enforce hot-standby to be enabled on the standby when logical slots
are created, and at startup if a logical slot exists
- fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
above.
- Have a nicer conflict handling than what I implemented here. Craig's
approach deleted the slots, but I'm not sure I like that. Blocking
seems more appropriately here, after all it's likely that the
replication topology would be broken afterwards.
- get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
optimized (e.g. check wal_level before opening rel etc).

Once we have this logic, it can be used to implement something like
failover slots on-top, by having having a mechanism that occasionally
forwards slots on standbys using pg_replication_slot_advance().

Greetings,

Andres Freund

[1]: /messages/by-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=XCg@mail.gmail.com
[2]: https://archives.postgresql.org/message-id/CAMsr%2BYEbS8ZZ%2Bw18j7OPM2MZEeDtGN9wDVF68%3DMzpeW%3DKRZZ9Q%40mail.gmail.com

Attachments:

logical-decoding-on-standby.difftext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index ab5aaff1566..cd068243d36 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1154,7 +1154,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 3eb722ce266..7f8604fbbe2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/hash_xlog.h"
 #include "access/heapam.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -25,7 +26,7 @@
 #include "storage/predicate.h"
 
 static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode);
+					  Relation heapRel);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -138,7 +139,7 @@ restart_insert:
 
 			if (IsBufferCleanupOK(buf))
 			{
-				_hash_vacuum_one_page(rel, metabuf, buf, heapRel->rd_node);
+				_hash_vacuum_one_page(rel, metabuf, buf, heapRel);
 
 				if (PageGetFreeSpace(page) >= itemsz)
 					break;		/* OK, now we have enough space */
@@ -337,7 +338,7 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 
 static void
 _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode)
+					  Relation heapRel)
 {
 	OffsetNumber deletable[MaxOffsetNumber];
 	int			ndeletable = 0;
@@ -394,7 +395,8 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
-			xlrec.hnode = hnode;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
+			xlrec.hnode = heapRel->rd_node;
 			xlrec.ntuples = ndeletable;
 
 			XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 96501456422..5de6311c2c8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7565,12 +7565,13 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7606,6 +7607,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7656,6 +7658,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7686,7 +7689,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7696,6 +7699,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -8116,7 +8120,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8152,7 +8157,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8248,7 +8254,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8385,7 +8393,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0d..acdce7f43ad 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -312,7 +312,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2d..481d3640499 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
@@ -704,6 +705,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1065,6 +1067,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.hnode = heapRel->rd_node;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a2..e3e21398065 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -698,7 +698,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -982,6 +983,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a83a4b581ed..c7c9c002a29 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 9e2bd3f8119..089fe58283b 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -913,6 +913,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8134c52253e..456f3323fee 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -450,7 +450,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9f99e4f0499..f8e89661715 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,7 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +112,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1f2e7139a70..1b723039a4f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,77 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != InvalidOid && s->data.database != dboid)
+			continue;
+
+		SpinLockAcquire(&s->mutex);
+		slotname = s->data.name;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+		SpinLockRelease(&s->mutex);
+
+		if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_xmin, xid)));
+		}
+
+		if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_catalog_xmin, xid)));
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (found_conflict)
+		goto restart;
+}
+
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index c9bb3e987d0..e14f5f132f4 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 7a263cc1fdc..fef7e13fe97 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1860,6 +1861,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 527138440b3..ac40dd26e8c 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 914897f83db..a702b86f481 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 					  TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 						  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031cd..0710e3a45c9 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -123,6 +123,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	int			nitems;
@@ -137,6 +138,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b72ccb5cc48..93185a08143 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 7964ae254f4..7a1228de934 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 1fcd8cf1b59..4b123ea67cf 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-									RelFileNode node);
+									bool catalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index ff1705ad2b8..0d3d49df605 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -129,6 +129,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Andres Freund (#1)

Re: Minimal logical decoding on standbys

On Wed, Dec 12, 2018 at 3:41 PM Andres Freund <andres@anarazel.de> wrote:

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements. I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts. As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

To me, this paragraph appears to say that you don't like Craig's
approach without quite explaining why you don't like it. Could you be
a bit more explicit about that?

I also don't think we should actually enforce hot_standby_feedback being
enabled - there's use-cases where that's not convenient, and it's not
bullet proof anyway (can be enabled/disabled without using logical
decoding inbetween). I think when there's a conflict we should have the
HINT mention that hs_feedback can be used to prevent such conflicts,
that ought to be enough.

If we can make that work, +1 from me.

I'm wondering if it's time to move the latestRemovedXid computation for
this type of record to the primary - it's likely to be cheaper there and
avoids this kind of complication. Secondarily, it'd have the advantage
of making pluggable storage integration easier - there we have the
problem that we don't know which type of relation we're dealing with
during recovery, so such lookups make pluggability harder (zheap just
adds extra flags to signal that, but that's not extensible).

That doesn't look trivial. It seems like _bt_delitems_delete() would
need to get an array of XIDs, but that gets called from
_bt_vacuum_one_page(), which doesn't have that information available.
It doesn't look like there is a particularly cheap way of getting it,
either. What do you have in mind?

Another alternative would be to just prevent such index deletions for
catalog tables when wal_level = logical.

That doesn't sound like a very nice idea.

If we were to go with this approach, there'd be at least the following
tasks:
- adapt tests from [2]

OK.

- enforce hot-standby to be enabled on the standby when logical slots
are created, and at startup if a logical slot exists

Why do we need this?

- fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
above.

OK.

- Have a nicer conflict handling than what I implemented here. Craig's
approach deleted the slots, but I'm not sure I like that. Blocking
seems more appropriately here, after all it's likely that the
replication topology would be broken afterwards.

I guess the viable options are approximately -- (1) drop the slot, (2)
advance the slot, (3) mark the slot as "failed" but leave it in
existence as a tombstone, (4) wait until something changes. I like
(3) better than (1). (4) seems pretty unfortunate unless there's some
other system for having the slot advance automatically. Seems like a
way for replication to hang indefinitely without anybody understanding
why it's happened (or, maybe, noticing).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Robert Haas (#2)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 2018-12-13 19:32:19 -0500, Robert Haas wrote:

On Wed, Dec 12, 2018 at 3:41 PM Andres Freund <andres@anarazel.de> wrote:

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements. I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts. As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

To me, this paragraph appears to say that you don't like Craig's
approach without quite explaining why you don't like it. Could you be
a bit more explicit about that?

I think the conflict system introduced in Craig's patch is quite
complicated, relies on logging new wal records on a regular basis, adds
needs to be more conservative about the xmin horizon, which is obviously
not great for performance.

If you look at Craig's patch, it currently relies on blocking out
concurrent checkpoints:
/*
* We must prevent a concurrent checkpoint, otherwise the catalog xmin
* advance xlog record with the new value might be written before the
* checkpoint but the checkpoint may still see the old
* oldestCatalogXmin value.
*/
if (!LWLockConditionalAcquire(CheckpointLock, LW_SHARED))
/* Couldn't get checkpointer lock; will retry later */
return;
which on its own seems unacceptable, given that CheckpointLock can be
held by checkpointer for a very long time. While that's ongoing the
catalog xmin horizon doesn't advance.

Looking at the code it seems hard, to me, to make that approach work
nicely. But I might just be tired.

I'm wondering if it's time to move the latestRemovedXid computation for
this type of record to the primary - it's likely to be cheaper there and
avoids this kind of complication. Secondarily, it'd have the advantage
of making pluggable storage integration easier - there we have the
problem that we don't know which type of relation we're dealing with
during recovery, so such lookups make pluggability harder (zheap just
adds extra flags to signal that, but that's not extensible).

That doesn't look trivial. It seems like _bt_delitems_delete() would
need to get an array of XIDs, but that gets called from
_bt_vacuum_one_page(), which doesn't have that information available.
It doesn't look like there is a particularly cheap way of getting it,
either. What do you have in mind?

I've a prototype attached, but let's discuss the details in a separate
thread. This also needs to be changed for pluggable storage, as we don't
know about table access methods in the startup process, so we can't call
can't determine which AM the heap is from during
btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Writing that message right now.

- enforce hot-standby to be enabled on the standby when logical slots
are created, and at startup if a logical slot exists

Why do we need this?

Currently the conflict routines are only called when hot standby is
on. There's also no way to use logical decoding (including just advancing the
slot), without hot-standby being enabled, so I think that'd be a pretty
harmless restriction.

- Have a nicer conflict handling than what I implemented here. Craig's
approach deleted the slots, but I'm not sure I like that. Blocking
seems more appropriately here, after all it's likely that the
replication topology would be broken afterwards.

I guess the viable options are approximately --

(1) drop the slot

Doable.

(2) advance the slot

That's not realistically possible, I think. We'd need to be able to use
most of the logical decoding infrastructure in that context, and we
don't have that available. It's also possible to deadlock, where
advancing the slot's xmin horizon would need further WAL, but WAL replay
is blocked on advancing the slot.

(3) mark the slot as "failed" but leave it in existence as a tombstone

We currently don't have that, but it'd be doable, I think.

(4) wait until something changes.
(4) seems pretty unfortunate unless there's some other system for
having the slot advance automatically. Seems like a way for
replication to hang indefinitely without anybody understanding why
it's happened (or, maybe, noticing).

On the other hand, it would often allow whatever user of the slot to
continue using it, till the conflict is "resolved". To me it seems about
as easy to debug physical replication being blocked, as somehow the slot
being magically deleted or marked as invalid.

Thanks for looking,

Andres Freund

Attachments:

index-page-vacuum-xid-horizon-primary.difftext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index ab5aaff1566..2f13a0fd2ad 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -969,155 +969,6 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
-/*
- * Get the latestRemovedXid from the heap pages pointed at by the index
- * tuples being deleted. See also btree_xlog_delete_get_latestRemovedXid,
- * on which this function is based.
- */
-static TransactionId
-hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
-{
-	xl_hash_vacuum_one_page *xlrec;
-	OffsetNumber *unused;
-	Buffer		ibuffer,
-				hbuffer;
-	Page		ipage,
-				hpage;
-	RelFileNode rnode;
-	BlockNumber blkno;
-	ItemId		iitemid,
-				hitemid;
-	IndexTuple	itup;
-	HeapTupleHeader htuphdr;
-	BlockNumber hblkno;
-	OffsetNumber hoffnum;
-	TransactionId latestRemovedXid = InvalidTransactionId;
-	int			i;
-
-	xlrec = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
-
-	/*
-	 * If there's nothing running on the standby we don't need to derive a
-	 * full latestRemovedXid value, so use a fast path out of here.  This
-	 * returns InvalidTransactionId, and so will conflict with all HS
-	 * transactions; but since we just worked out that that's zero people,
-	 * it's OK.
-	 *
-	 * XXX There is a race condition here, which is that a new backend might
-	 * start just after we look.  If so, it cannot need to conflict, but this
-	 * coding will result in throwing a conflict anyway.
-	 */
-	if (CountDBBackends(InvalidOid) == 0)
-		return latestRemovedXid;
-
-	/*
-	 * Check if WAL replay has reached a consistent database state. If not, we
-	 * must PANIC. See the definition of
-	 * btree_xlog_delete_get_latestRemovedXid for more details.
-	 */
-	if (!reachedConsistency)
-		elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data");
-
-	/*
-	 * Get index page.  If the DB is consistent, this should not fail, nor
-	 * should any of the heap page fetches below.  If one does, we return
-	 * InvalidTransactionId to cancel all HS transactions.  That's probably
-	 * overkill, but it's safe, and certainly better than panicking here.
-	 */
-	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
-	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
-
-	if (!BufferIsValid(ibuffer))
-		return InvalidTransactionId;
-	LockBuffer(ibuffer, HASH_READ);
-	ipage = (Page) BufferGetPage(ibuffer);
-
-	/*
-	 * Loop through the deleted index items to obtain the TransactionId from
-	 * the heap items they point to.
-	 */
-	unused = (OffsetNumber *) ((char *) xlrec + SizeOfHashVacuumOnePage);
-
-	for (i = 0; i < xlrec->ntuples; i++)
-	{
-		/*
-		 * Identify the index tuple about to be deleted.
-		 */
-		iitemid = PageGetItemId(ipage, unused[i]);
-		itup = (IndexTuple) PageGetItem(ipage, iitemid);
-
-		/*
-		 * Locate the heap page that the index tuple points at
-		 */
-		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
-		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
-										 hblkno, RBM_NORMAL);
-
-		if (!BufferIsValid(hbuffer))
-		{
-			UnlockReleaseBuffer(ibuffer);
-			return InvalidTransactionId;
-		}
-		LockBuffer(hbuffer, HASH_READ);
-		hpage = (Page) BufferGetPage(hbuffer);
-
-		/*
-		 * Look up the heap tuple header that the index tuple points at by
-		 * using the heap node supplied with the xlrec. We can't use
-		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
-		 * Note that we are not looking at tuple data here, just headers.
-		 */
-		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
-		hitemid = PageGetItemId(hpage, hoffnum);
-
-		/*
-		 * Follow any redirections until we find something useful.
-		 */
-		while (ItemIdIsRedirected(hitemid))
-		{
-			hoffnum = ItemIdGetRedirect(hitemid);
-			hitemid = PageGetItemId(hpage, hoffnum);
-			CHECK_FOR_INTERRUPTS();
-		}
-
-		/*
-		 * If the heap item has storage, then read the header and use that to
-		 * set latestRemovedXid.
-		 *
-		 * Some LP_DEAD items may not be accessible, so we ignore them.
-		 */
-		if (ItemIdHasStorage(hitemid))
-		{
-			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
-			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
-		}
-		else if (ItemIdIsDead(hitemid))
-		{
-			/*
-			 * Conjecture: if hitemid is dead then it had xids before the xids
-			 * marked on LP_NORMAL items. So we just ignore this item and move
-			 * onto the next, for the purposes of calculating
-			 * latestRemovedxids.
-			 */
-		}
-		else
-			Assert(!ItemIdIsUsed(hitemid));
-
-		UnlockReleaseBuffer(hbuffer);
-	}
-
-	UnlockReleaseBuffer(ibuffer);
-
-	/*
-	 * If all heap tuples were LP_DEAD then we will be returning
-	 * InvalidTransactionId here, which avoids conflicts. This matches
-	 * existing logic which assumes that LP_DEAD tuples must already be older
-	 * than the latestRemovedXid on the cleanup record that set them as
-	 * LP_DEAD, hence must already have generated a conflict.
-	 */
-	return latestRemovedXid;
-}
-
 /*
  * replay delete operation in hash index to remove
  * tuples marked as DEAD during index tuple insertion.
@@ -1149,12 +1000,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 	{
-		TransactionId latestRemovedXid =
-		hash_xlog_vacuum_get_latestRemovedXid(record);
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 3eb722ce266..f9a261a713f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -24,8 +24,8 @@
 #include "storage/buf_internals.h"
 #include "storage/predicate.h"
 
-static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode);
+static void _hash_vacuum_one_page(Relation rel, Relation hrel,
+								  Buffer metabuf, Buffer buf);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -138,7 +138,7 @@ restart_insert:
 
 			if (IsBufferCleanupOK(buf))
 			{
-				_hash_vacuum_one_page(rel, metabuf, buf, heapRel->rd_node);
+				_hash_vacuum_one_page(rel, heapRel, metabuf, buf);
 
 				if (PageGetFreeSpace(page) >= itemsz)
 					break;		/* OK, now we have enough space */
@@ -336,8 +336,7 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
  */
 
 static void
-_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode)
+_hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 {
 	OffsetNumber deletable[MaxOffsetNumber];
 	int			ndeletable = 0;
@@ -361,6 +360,10 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 
 	if (ndeletable > 0)
 	{
+		TransactionId latestRemovedXid;
+
+		latestRemovedXid = index_compute_xid_horizon_for_tuples(rel, hrel, buf, deletable, ndeletable);
+
 		/*
 		 * Write-lock the meta page so that we can decrement tuple count.
 		 */
@@ -394,7 +397,8 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
-			xlrec.hnode = hnode;
+			xlrec.latestRemovedXid = latestRemovedXid;
+			xlrec.hnode = hrel->rd_node;
 			xlrec.ntuples = ndeletable;
 
 			XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 96501456422..049a8498e8f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7558,6 +7558,135 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 	/* *latestRemovedXid may still be invalid at end */
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted.
+ *
+ * This puts the work for calculating latestRemovedXid into the recovery path
+ * rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ *
+ * XXX optimise later with something like XLogPrefetchBuffer()
+ */
+TransactionId
+heap_compute_xid_horizon_for_tuples(Relation rel,
+									ItemPointerData *tids,
+									int nitems)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	BlockNumber hblkno;
+	Buffer		buf = InvalidBuffer;
+	Page		hpage;
+
+	/*
+	 * Sort to avoid repeated lookups for the same page, and to make it more
+	 * likely to access items in an efficient order. In particular this
+	 * ensures thaf if there are multiple pointers to the same page, they all
+	 * get processed looking up and locking the page just once.
+	 */
+	qsort((void *) tids, nitems, sizeof(ItemPointerData),
+		  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+	/* prefetch all pages */
+#ifdef USE_PREFETCH
+	hblkno = InvalidBlockNumber;
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemPointer htid = &tids[i];
+
+		if (hblkno == InvalidBlockNumber ||
+			ItemPointerGetBlockNumber(htid) != hblkno)
+		{
+			hblkno = ItemPointerGetBlockNumber(htid);
+
+			PrefetchBuffer(rel, MAIN_FORKNUM, hblkno);
+		}
+	}
+#endif
+
+	/* Iterate over all tids, and check their horizon */
+	hblkno = InvalidBlockNumber;
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemPointer htid = &tids[i];
+		ItemId hitemid;
+		OffsetNumber hoffnum;
+
+		/*
+		 * Read heap buffer, but avoid refetching if it's the same block as
+		 * required for the last tid.
+		 */
+		if (hblkno == InvalidBlockNumber ||
+			ItemPointerGetBlockNumber(htid) != hblkno)
+		{
+			/* release old buffer */
+			if (BufferIsValid(buf))
+			{
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+				ReleaseBuffer(buf);
+			}
+
+			hblkno = ItemPointerGetBlockNumber(htid);
+
+			buf = ReadBuffer(rel, hblkno);
+			hpage = BufferGetPage(buf);
+
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+		}
+
+		hoffnum = ItemPointerGetOffsetNumber(htid);
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			HeapTupleHeader htuphdr;
+
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+	}
+
+	if (BufferIsValid(buf))
+	{
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buf);
+	}
+
+	return latestRemovedXid;
+}
+
 /*
  * Perform XLogInsert to register a heap cleanup info message. These
  * messages are sent once per VACUUM and are required because
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 9d087756879..c4064b7c02e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -275,6 +275,42 @@ BuildIndexValueDescription(Relation indexRelation,
 	return buf.data;
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted.
+ */
+TransactionId
+index_compute_xid_horizon_for_tuples(Relation irel,
+									 Relation hrel,
+									 Buffer ibuf,
+									 OffsetNumber *itemnos,
+									 int nitems)
+{
+	ItemPointerData *htids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems);
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		ipage = BufferGetPage(ibuf);
+	IndexTuple	itup;
+
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId iitemid;
+
+		iitemid = PageGetItemId(ipage, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		ItemPointerCopy(&itup->t_tid, &htids[i]);
+	}
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		heap_compute_xid_horizon_for_tuples(hrel, htids, nitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 
 /* ----------------------------------------------------------------
  *		heap-or-index-scan access to system catalogs
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2d..7228c012ad5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1032,10 +1032,16 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	TransactionId latestRemovedXid = InvalidTransactionId;
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(nitems > 0);
 
+	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		latestRemovedXid =
+			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+												 itemnos, nitems);
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1065,6 +1071,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.hnode = heapRel->rd_node;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a2..052de4b2f3d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -518,159 +518,6 @@ btree_xlog_vacuum(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 }
 
-/*
- * Get the latestRemovedXid from the heap pages pointed at by the index
- * tuples being deleted. This puts the work for calculating latestRemovedXid
- * into the recovery path rather than the primary path.
- *
- * It's possible that this generates a fair amount of I/O, since an index
- * block may have hundreds of tuples being deleted. Repeat accesses to the
- * same heap blocks are common, though are not yet optimised.
- *
- * XXX optimise later with something like XLogPrefetchBuffer()
- */
-static TransactionId
-btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
-{
-	xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
-	OffsetNumber *unused;
-	Buffer		ibuffer,
-				hbuffer;
-	Page		ipage,
-				hpage;
-	RelFileNode rnode;
-	BlockNumber blkno;
-	ItemId		iitemid,
-				hitemid;
-	IndexTuple	itup;
-	HeapTupleHeader htuphdr;
-	BlockNumber hblkno;
-	OffsetNumber hoffnum;
-	TransactionId latestRemovedXid = InvalidTransactionId;
-	int			i;
-
-	/*
-	 * If there's nothing running on the standby we don't need to derive a
-	 * full latestRemovedXid value, so use a fast path out of here.  This
-	 * returns InvalidTransactionId, and so will conflict with all HS
-	 * transactions; but since we just worked out that that's zero people,
-	 * it's OK.
-	 *
-	 * XXX There is a race condition here, which is that a new backend might
-	 * start just after we look.  If so, it cannot need to conflict, but this
-	 * coding will result in throwing a conflict anyway.
-	 */
-	if (CountDBBackends(InvalidOid) == 0)
-		return latestRemovedXid;
-
-	/*
-	 * In what follows, we have to examine the previous state of the index
-	 * page, as well as the heap page(s) it points to.  This is only valid if
-	 * WAL replay has reached a consistent database state; which means that
-	 * the preceding check is not just an optimization, but is *necessary*. We
-	 * won't have let in any user sessions before we reach consistency.
-	 */
-	if (!reachedConsistency)
-		elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data");
-
-	/*
-	 * Get index page.  If the DB is consistent, this should not fail, nor
-	 * should any of the heap page fetches below.  If one does, we return
-	 * InvalidTransactionId to cancel all HS transactions.  That's probably
-	 * overkill, but it's safe, and certainly better than panicking here.
-	 */
-	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
-	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
-	if (!BufferIsValid(ibuffer))
-		return InvalidTransactionId;
-	LockBuffer(ibuffer, BT_READ);
-	ipage = (Page) BufferGetPage(ibuffer);
-
-	/*
-	 * Loop through the deleted index items to obtain the TransactionId from
-	 * the heap items they point to.
-	 */
-	unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
-
-	for (i = 0; i < xlrec->nitems; i++)
-	{
-		/*
-		 * Identify the index tuple about to be deleted
-		 */
-		iitemid = PageGetItemId(ipage, unused[i]);
-		itup = (IndexTuple) PageGetItem(ipage, iitemid);
-
-		/*
-		 * Locate the heap page that the index tuple points at
-		 */
-		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
-		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM, hblkno, RBM_NORMAL);
-		if (!BufferIsValid(hbuffer))
-		{
-			UnlockReleaseBuffer(ibuffer);
-			return InvalidTransactionId;
-		}
-		LockBuffer(hbuffer, BT_READ);
-		hpage = (Page) BufferGetPage(hbuffer);
-
-		/*
-		 * Look up the heap tuple header that the index tuple points at by
-		 * using the heap node supplied with the xlrec. We can't use
-		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
-		 * Note that we are not looking at tuple data here, just headers.
-		 */
-		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
-		hitemid = PageGetItemId(hpage, hoffnum);
-
-		/*
-		 * Follow any redirections until we find something useful.
-		 */
-		while (ItemIdIsRedirected(hitemid))
-		{
-			hoffnum = ItemIdGetRedirect(hitemid);
-			hitemid = PageGetItemId(hpage, hoffnum);
-			CHECK_FOR_INTERRUPTS();
-		}
-
-		/*
-		 * If the heap item has storage, then read the header and use that to
-		 * set latestRemovedXid.
-		 *
-		 * Some LP_DEAD items may not be accessible, so we ignore them.
-		 */
-		if (ItemIdHasStorage(hitemid))
-		{
-			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
-
-			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
-		}
-		else if (ItemIdIsDead(hitemid))
-		{
-			/*
-			 * Conjecture: if hitemid is dead then it had xids before the xids
-			 * marked on LP_NORMAL items. So we just ignore this item and move
-			 * onto the next, for the purposes of calculating
-			 * latestRemovedxids.
-			 */
-		}
-		else
-			Assert(!ItemIdIsUsed(hitemid));
-
-		UnlockReleaseBuffer(hbuffer);
-	}
-
-	UnlockReleaseBuffer(ibuffer);
-
-	/*
-	 * If all heap tuples were LP_DEAD then we will be returning
-	 * InvalidTransactionId here, which avoids conflicts. This matches
-	 * existing logic which assumes that LP_DEAD tuples must already be older
-	 * than the latestRemovedXid on the cleanup record that set them as
-	 * LP_DEAD, hence must already have generated a conflict.
-	 */
-	return latestRemovedXid;
-}
-
 static void
 btree_xlog_delete(XLogReaderState *record)
 {
@@ -693,12 +540,11 @@ btree_xlog_delete(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 	{
-		TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 	}
 
 	/*
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 534fac7bf2f..0318da88bc2 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -186,6 +186,11 @@ extern IndexScanDesc RelationGetIndexScan(Relation indexRelation,
 extern void IndexScanEnd(IndexScanDesc scan);
 extern char *BuildIndexValueDescription(Relation indexRelation,
 						   Datum *values, bool *isnull);
+extern TransactionId index_compute_xid_horizon_for_tuples(Relation irel,
+														  Relation hrel,
+														  Buffer ibuf,
+														  OffsetNumber *itemnos,
+														  int nitems);
 
 /*
  * heap-or-index access to system catalogs (in genam.c)
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 527138440b3..d46dc1a85b3 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	TransactionId latestRemovedXid;
 	RelFileNode hnode;
 	int			ntuples;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 64cfdbd2f06..af8612e625b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -184,6 +184,10 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
+extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
+														 ItemPointerData *items,
+														 int nitems);
+
 /* in heap/pruneheap.c */
 extern void heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int heap_page_prune(Relation relation, Buffer buffer,
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031cd..ca2a729169a 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -123,6 +123,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	TransactionId latestRemovedXid;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	int			nitems;

Petr Jelinek

petr.jelinek@2ndquadrant.com

about 7 years ago

In reply to: Andres Freund (#1)

Re: Minimal logical decoding on standbys

Hi,

On 12/12/2018 21:41, Andres Freund wrote:

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements. I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts. As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

IIRC I was originally advocating adding that xmin announcement to the
standby snapshot message, but this seems better.

If we were to go with this approach, there'd be at least the following
tasks:
- adapt tests from [2]
- enforce hot-standby to be enabled on the standby when logical slots
are created, and at startup if a logical slot exists
- fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
above.
- Have a nicer conflict handling than what I implemented here. Craig's
approach deleted the slots, but I'm not sure I like that. Blocking
seems more appropriately here, after all it's likely that the
replication topology would be broken afterwards.
- get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
optimized (e.g. check wal_level before opening rel etc).

Once we have this logic, it can be used to implement something like
failover slots on-top, by having having a mechanism that occasionally
forwards slots on standbys using pg_replication_slot_advance().

Looking at this from the failover slots perspective. Wouldn't blocking
on conflict mean that we stop physical replication on catalog xmin
advance when there is lagging logical replication on primary? It might
not be too big deal as in that use-case it should only happen if
hs_feedback was off at some point, but just wanted to point out this
potential problem.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: Petr Jelinek (#4)

Re: Minimal logical decoding on standbys

Hi,

While testingï¿½ this featureï¿½ found that - if lots of insert happened on
the master cluster then pg_recvlogical is not showing the DATA
informationï¿½ on logical replication slot which created on SLAVE.

Please refer this scenario -

1)
Create a Master cluster with wal_level=logcal and create logical
replication slot -
ï¿½SELECT * FROM pg_create_logical_replication_slot('master_slot',
'test_decoding');

2)
Create a Standbyï¿½ cluster using pg_basebackup ( ./pg_basebackup -D
slave/ -v -R)ï¿½ and create logical replication slot -
SELECT * FROM pg_create_logical_replication_slot('standby_slot',
'test_decoding');

3)
X terminal - startï¿½ pg_recvlogicalï¿½ , provide port=5555 ( slave
cluster)ï¿½ and specify slot=standby_slot
./pg_recvlogical -d postgresï¿½ -p 5555 -s 1 -F 1ï¿½ -v --slot=standby_slotï¿½
--start -f -

Y terminal - startï¿½ pg_recvlogicalï¿½ , provide port=5432 ( master
cluster)ï¿½ and specify slot=master_slot
./pg_recvlogical -d postgresï¿½ -p 5432 -s 1 -F 1ï¿½ -v --slot=master_slotï¿½
--start -f -

Z terminal - run pg_benchï¿½ against Master cluster ( ./pg_bench -i -s 10
postgres)

Able to see DATA information on Y terminalï¿½ but not on X.

but same able to see by firing this below query on SLAVE cluster -

SELECT * FROM pg_logical_slot_get_changes('standby_slot', NULL, NULL);

Is it expected ?

regards,
tushar

On 12/17/2018 10:46 PM, Petr Jelinek wrote:

Hi,

On 12/12/2018 21:41, Andres Freund wrote:

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements. I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts. As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

IIRC I was originally advocating adding that xmin announcement to the
standby snapshot message, but this seems better.

If we were to go with this approach, there'd be at least the following
tasks:
- adapt tests from [2]
- enforce hot-standby to be enabled on the standby when logical slots
are created, and at startup if a logical slot exists
- fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
above.
- Have a nicer conflict handling than what I implemented here. Craig's
approach deleted the slots, but I'm not sure I like that. Blocking
seems more appropriately here, after all it's likely that the
replication topology would be broken afterwards.
- get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
optimized (e.g. check wal_level before opening rel etc).

Once we have this logic, it can be used to implement something like
failover slots on-top, by having having a mechanism that occasionally
forwards slots on standbys using pg_replication_slot_advance().

Looking at this from the failover slots perspective. Wouldn't blocking
on conflict mean that we stop physical replication on catalog xmin
advance when there is lagging logical replication on primary? It might
not be too big deal as in that use-case it should only happen if
hs_feedback was off at some point, but just wanted to point out this
potential problem.

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: tushar (#5)

Re: Minimal logical decoding on standbys

Hi,

On 2019-03-01 13:33:23 +0530, tushar wrote:

While testingï¿½ this featureï¿½ found that - if lots of insert happened on the
master cluster then pg_recvlogical is not showing the DATA informationï¿½ on
logical replication slot which created on SLAVE.

Please refer this scenario -

1)
Create a Master cluster with wal_level=logcal and create logical replication
slot -
ï¿½SELECT * FROM pg_create_logical_replication_slot('master_slot',
'test_decoding');

2)
Create a Standbyï¿½ cluster using pg_basebackup ( ./pg_basebackup -D slave/ -v
-R)ï¿½ and create logical replication slot -
SELECT * FROM pg_create_logical_replication_slot('standby_slot',
'test_decoding');

So, if I understand correctly you do *not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Thanks,

Andres

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Andres Freund (#3)

4 attachment(s)

Re: Minimal logical decoding on standbys

On Fri, 14 Dec 2018 at 06:25, Andres Freund <andres@anarazel.de> wrote:

I've a prototype attached, but let's discuss the details in a separate
thread. This also needs to be changed for pluggable storage, as we don't
know about table access methods in the startup process, so we can't call
can't determine which AM the heap is from during
btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Attached is a WIP test patch
0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
modified version of Craig Ringer's test cases
(012_logical_decoding_on_replica.pl) that he had attached in [1]/messages/by-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=XCg@mail.gmail.com.
Here, I have also attached his original file
(Craigs_012_logical_decoding_on_replica.pl).

Also attached are rebased versions of couple of Andres's implementation patches.

I have added a new test scenario :
DROP TABLE from master *before* the logical records of the table
insertions are retrieved from standby. The logical records should be
successfully retrieved.

Regarding the test result failures, I could see that when we drop a
logical replication slot at standby server, then the catalog_xmin of
physical replication slot becomes NULL, whereas the test expects it to
be equal to xmin; and that's the reason a couple of test scenarios are
failing :

ok 33 - slot on standby dropped manually
Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
done
not ok 34 - physical catalog_xmin still non-null
not ok 35 - xmin and catalog_xmin equal after slot drop
# Failed test 'xmin and catalog_xmin equal after slot drop'
# at t/016_logical_decoding_on_replica.pl line 272.
# got:
# expected: 2584

Other than the above, there is this test scenario which I had to remove :

#########################################################
# Conflict with recovery: xmin cancels decoding session
#########################################################
#
# Start a transaction on the replica then perform work that should cause a
# recovery conflict with it. We'll check to make sure the client gets
# terminated with recovery conflict.
#
# Temporarily disable hs feedback so we can test recovery conflicts.
# It's fine to continue using a physical slot, the xmin should be
# cleared. We only check hot_standby_feedback when establishing
# a new decoding session so this approach circumvents the safeguards
# in place and forces a conflict.

This test starts pg_recvlogical, and expects it to be terminated due
to recovery conflict because hs feedback is disabled.
But that does not happen; instead, pg_recvlogical does not return.

But I am not sure why it does not terminate with Andres's patch; it
was expected to terminate with Craig Ringer's patch.

Further, there are subsequent test scenarios that test pg_recvlogical
with hs_feedback disabled, which I have removed because pg_recvlogical
does not return. I am yet to clearly understand why that happens. I
suspect that is only because hs_feedback is disabled.

Also, the testcases verify pg_controldata's oldestCatalogXmin values,
which are now not present with Andres's patch; so I removed tracking
of oldestCatalogXmin.

[1]: /messages/by-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=XCg@mail.gmail.com

Thanks
-Amit Khandekar

Attachments:

0001-Logical-decoding-on-standby_rebased.patchapplication/octet-stream; name=0001-Logical-decoding-on-standby_rebased.patchDownload

From 52a1ff5616f8eaed18db6fe1e44ab44d65d6ffd3 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 26 Feb 2019 11:18:27 +0530
Subject: [PATCH 1/3] Logical decoding on standby

Andres Freund.
---
 src/backend/access/gist/gistxlog.c        |  3 +-
 src/backend/access/hash/hash_xlog.c       |  3 +-
 src/backend/access/hash/hashinsert.c      | 10 +++--
 src/backend/access/heap/heapam.c          | 23 +++++++---
 src/backend/access/heap/vacuumlazy.c      |  2 +-
 src/backend/access/heap/visibilitymap.c   |  2 +-
 src/backend/access/nbtree/nbtpage.c       |  3 ++
 src/backend/access/nbtree/nbtxlog.c       |  4 +-
 src/backend/access/spgist/spgvacuum.c     |  2 +
 src/backend/access/spgist/spgxlog.c       |  1 +
 src/backend/replication/logical/logical.c |  2 +
 src/backend/replication/slot.c            | 71 +++++++++++++++++++++++++++++++
 src/backend/storage/ipc/standby.c         |  7 ++-
 src/backend/utils/cache/lsyscache.c       | 16 +++++++
 src/include/access/gistxlog.h             |  2 +-
 src/include/access/hash_xlog.h            |  1 +
 src/include/access/heapam_xlog.h          |  8 +++-
 src/include/access/nbtxlog.h              |  2 +
 src/include/access/spgxlog.h              |  1 +
 src/include/replication/slot.h            |  2 +
 src/include/storage/standby.h             |  2 +-
 src/include/utils/lsyscache.h             |  1 +
 22 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd53..f86ec7c 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -341,7 +341,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index c6d8726..14456fa 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1154,7 +1154,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 970733f..fd75d0e 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -24,7 +25,7 @@
 #include "storage/predicate.h"
 
 static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode);
+					  Relation heapRel);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -137,7 +138,7 @@ restart_insert:
 
 			if (IsBufferCleanupOK(buf))
 			{
-				_hash_vacuum_one_page(rel, metabuf, buf, heapRel->rd_node);
+				_hash_vacuum_one_page(rel, metabuf, buf, heapRel);
 
 				if (PageGetFreeSpace(page) >= itemsz)
 					break;		/* OK, now we have enough space */
@@ -336,7 +337,7 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 
 static void
 _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  RelFileNode hnode)
+					  Relation heapRel)
 {
 	OffsetNumber deletable[MaxOffsetNumber];
 	int			ndeletable = 0;
@@ -393,7 +394,8 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
-			xlrec.hnode = hnode;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
+			xlrec.hnode = heapRel->rd_node;
 			xlrec.ntuples = ndeletable;
 
 			XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc34993..982fdc7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7184,12 +7184,13 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7225,6 +7226,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7275,6 +7277,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7305,7 +7308,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7315,6 +7318,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7735,7 +7739,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7771,7 +7776,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7867,7 +7873,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8004,7 +8012,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9416c31..affc8d2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -449,7 +449,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bc..674b3f1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
@@ -704,6 +705,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1065,6 +1067,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.hnode = heapRel->rd_node;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b4..30f2e62 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -698,7 +698,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -982,6 +983,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index b9311ce..ef4910f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 71836ee..c66137a 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -913,6 +913,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e5bc12..e8b7af4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,7 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +112,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6..d8104aa 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,77 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != InvalidOid && s->data.database != dboid)
+			continue;
+
+		SpinLockAcquire(&s->mutex);
+		slotname = s->data.name;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+		SpinLockRelease(&s->mutex);
+
+		if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_xmin, xid)));
+		}
+
+		if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_catalog_xmin, xid)));
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+
+	if (found_conflict)
+		goto restart;
+}
+
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4d10e57..f483d53 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index e88c45d..2441737 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1847,6 +1849,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aab..71b1aa7 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -46,10 +46,10 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 9cef1b7..455d701 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 22cd13c..482c874 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 					  TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 						  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851..23f950f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -123,6 +123,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	int			nitems;
@@ -137,6 +138,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 6527fc9..50f334a 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8f1d66..4e0776a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 346a310..27c09d1 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-									RelFileNode node);
+									bool catalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 16b0b1d..3337d7d 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -129,6 +129,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
-- 
2.1.4

Craigs_012_logical_decoding_on_replica.pltext/x-perl-script; charset=US-ASCII; name=Craigs_012_logical_decoding_on_replica.plDownload

0002-Move-latestRemovedXid-computation-for-nbtree-xlog-rebased.patchapplication/octet-stream; name=0002-Move-latestRemovedXid-computation-for-nbtree-xlog-rebased.patchDownload

From ca7a7c51f8b4302932d805f477b7f134bda40a9d Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 26 Feb 2019 11:36:29 +0530
Subject: [PATCH 2/3] Move latestRemovedXid computation for nbtree xlog record
 to primary.

Andres Freund.
---
 src/backend/access/hash/hash_xlog.c  | 153 +---------------------------------
 src/backend/access/hash/hashinsert.c |  19 +++--
 src/backend/access/heap/heapam.c     | 129 +++++++++++++++++++++++++++++
 src/backend/access/index/genam.c     |  36 ++++++++
 src/backend/access/nbtree/nbtpage.c  |   7 ++
 src/backend/access/nbtree/nbtxlog.c  | 156 +----------------------------------
 src/include/access/genam.h           |   5 ++
 src/include/access/hash_xlog.h       |   1 +
 src/include/access/heapam.h          |   4 +
 src/include/access/nbtxlog.h         |   1 +
 10 files changed, 197 insertions(+), 314 deletions(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 14456fa..3af5050 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -970,155 +970,6 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 }
 
 /*
- * Get the latestRemovedXid from the heap pages pointed at by the index
- * tuples being deleted. See also btree_xlog_delete_get_latestRemovedXid,
- * on which this function is based.
- */
-static TransactionId
-hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
-{
-	xl_hash_vacuum_one_page *xlrec;
-	OffsetNumber *unused;
-	Buffer		ibuffer,
-				hbuffer;
-	Page		ipage,
-				hpage;
-	RelFileNode rnode;
-	BlockNumber blkno;
-	ItemId		iitemid,
-				hitemid;
-	IndexTuple	itup;
-	HeapTupleHeader htuphdr;
-	BlockNumber hblkno;
-	OffsetNumber hoffnum;
-	TransactionId latestRemovedXid = InvalidTransactionId;
-	int			i;
-
-	xlrec = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
-
-	/*
-	 * If there's nothing running on the standby we don't need to derive a
-	 * full latestRemovedXid value, so use a fast path out of here.  This
-	 * returns InvalidTransactionId, and so will conflict with all HS
-	 * transactions; but since we just worked out that that's zero people,
-	 * it's OK.
-	 *
-	 * XXX There is a race condition here, which is that a new backend might
-	 * start just after we look.  If so, it cannot need to conflict, but this
-	 * coding will result in throwing a conflict anyway.
-	 */
-	if (CountDBBackends(InvalidOid) == 0)
-		return latestRemovedXid;
-
-	/*
-	 * Check if WAL replay has reached a consistent database state. If not, we
-	 * must PANIC. See the definition of
-	 * btree_xlog_delete_get_latestRemovedXid for more details.
-	 */
-	if (!reachedConsistency)
-		elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data");
-
-	/*
-	 * Get index page.  If the DB is consistent, this should not fail, nor
-	 * should any of the heap page fetches below.  If one does, we return
-	 * InvalidTransactionId to cancel all HS transactions.  That's probably
-	 * overkill, but it's safe, and certainly better than panicking here.
-	 */
-	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
-	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
-
-	if (!BufferIsValid(ibuffer))
-		return InvalidTransactionId;
-	LockBuffer(ibuffer, HASH_READ);
-	ipage = (Page) BufferGetPage(ibuffer);
-
-	/*
-	 * Loop through the deleted index items to obtain the TransactionId from
-	 * the heap items they point to.
-	 */
-	unused = (OffsetNumber *) ((char *) xlrec + SizeOfHashVacuumOnePage);
-
-	for (i = 0; i < xlrec->ntuples; i++)
-	{
-		/*
-		 * Identify the index tuple about to be deleted.
-		 */
-		iitemid = PageGetItemId(ipage, unused[i]);
-		itup = (IndexTuple) PageGetItem(ipage, iitemid);
-
-		/*
-		 * Locate the heap page that the index tuple points at
-		 */
-		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
-		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
-										 hblkno, RBM_NORMAL);
-
-		if (!BufferIsValid(hbuffer))
-		{
-			UnlockReleaseBuffer(ibuffer);
-			return InvalidTransactionId;
-		}
-		LockBuffer(hbuffer, HASH_READ);
-		hpage = (Page) BufferGetPage(hbuffer);
-
-		/*
-		 * Look up the heap tuple header that the index tuple points at by
-		 * using the heap node supplied with the xlrec. We can't use
-		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
-		 * Note that we are not looking at tuple data here, just headers.
-		 */
-		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
-		hitemid = PageGetItemId(hpage, hoffnum);
-
-		/*
-		 * Follow any redirections until we find something useful.
-		 */
-		while (ItemIdIsRedirected(hitemid))
-		{
-			hoffnum = ItemIdGetRedirect(hitemid);
-			hitemid = PageGetItemId(hpage, hoffnum);
-			CHECK_FOR_INTERRUPTS();
-		}
-
-		/*
-		 * If the heap item has storage, then read the header and use that to
-		 * set latestRemovedXid.
-		 *
-		 * Some LP_DEAD items may not be accessible, so we ignore them.
-		 */
-		if (ItemIdHasStorage(hitemid))
-		{
-			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
-			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
-		}
-		else if (ItemIdIsDead(hitemid))
-		{
-			/*
-			 * Conjecture: if hitemid is dead then it had xids before the xids
-			 * marked on LP_NORMAL items. So we just ignore this item and move
-			 * onto the next, for the purposes of calculating
-			 * latestRemovedxids.
-			 */
-		}
-		else
-			Assert(!ItemIdIsUsed(hitemid));
-
-		UnlockReleaseBuffer(hbuffer);
-	}
-
-	UnlockReleaseBuffer(ibuffer);
-
-	/*
-	 * If all heap tuples were LP_DEAD then we will be returning
-	 * InvalidTransactionId here, which avoids conflicts. This matches
-	 * existing logic which assumes that LP_DEAD tuples must already be older
-	 * than the latestRemovedXid on the cleanup record that set them as
-	 * LP_DEAD, hence must already have generated a conflict.
-	 */
-	return latestRemovedXid;
-}
-
-/*
  * replay delete operation in hash index to remove
  * tuples marked as DEAD during index tuple insertion.
  */
@@ -1149,12 +1000,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 	{
-		TransactionId latestRemovedXid =
-		hash_xlog_vacuum_get_latestRemovedXid(record);
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
 											xldata->onCatalogTable, rnode);
 	}
 
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fd75d0e..88e2b3d 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -24,8 +24,8 @@
 #include "storage/buf_internals.h"
 #include "storage/predicate.h"
 
-static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  Relation heapRel);
+static void _hash_vacuum_one_page(Relation rel, Relation hrel,
+					  Buffer metabuf, Buffer buf);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -138,7 +138,7 @@ restart_insert:
 
 			if (IsBufferCleanupOK(buf))
 			{
-				_hash_vacuum_one_page(rel, metabuf, buf, heapRel);
+				_hash_vacuum_one_page(rel, heapRel, metabuf, buf);
 
 				if (PageGetFreeSpace(page) >= itemsz)
 					break;		/* OK, now we have enough space */
@@ -336,8 +336,8 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
  */
 
 static void
-_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
-					  Relation heapRel)
+_hash_vacuum_one_page(Relation rel, Relation hrel,
+					  Buffer metabuf, Buffer buf)
 {
 	OffsetNumber deletable[MaxOffsetNumber];
 	int			ndeletable = 0;
@@ -361,6 +361,10 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 
 	if (ndeletable > 0)
 	{
+		TransactionId latestRemovedXid;
+
+		latestRemovedXid = index_compute_xid_horizon_for_tuples(rel, hrel, buf, deletable, ndeletable);
+
 		/*
 		 * Write-lock the meta page so that we can decrement tuple count.
 		 */
@@ -394,8 +398,9 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
-			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
-			xlrec.hnode = heapRel->rd_node;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
+			xlrec.latestRemovedXid = latestRemovedXid;
+			xlrec.hnode = hrel->rd_node;
 			xlrec.ntuples = ndeletable;
 
 			XLogBeginInsert();
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 982fdc7..c686b80 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7178,6 +7178,135 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 }
 
 /*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted.
+ *
+ * This puts the work for calculating latestRemovedXid into the recovery path
+ * rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ *
+ * XXX optimise later with something like XLogPrefetchBuffer()
+ */
+TransactionId
+heap_compute_xid_horizon_for_tuples(Relation rel,
+									ItemPointerData *tids,
+									int nitems)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	BlockNumber hblkno;
+	Buffer		buf = InvalidBuffer;
+	Page		hpage;
+
+	/*
+	 * Sort to avoid repeated lookups for the same page, and to make it more
+	 * likely to access items in an efficient order. In particular this
+	 * ensures thaf if there are multiple pointers to the same page, they all
+	 * get processed looking up and locking the page just once.
+	 */
+	qsort((void *) tids, nitems, sizeof(ItemPointerData),
+		  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+	/* prefetch all pages */
+#ifdef USE_PREFETCH
+	hblkno = InvalidBlockNumber;
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemPointer htid = &tids[i];
+
+		if (hblkno == InvalidBlockNumber ||
+			ItemPointerGetBlockNumber(htid) != hblkno)
+		{
+			hblkno = ItemPointerGetBlockNumber(htid);
+
+			PrefetchBuffer(rel, MAIN_FORKNUM, hblkno);
+		}
+	}
+#endif
+
+	/* Iterate over all tids, and check their horizon */
+	hblkno = InvalidBlockNumber;
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemPointer htid = &tids[i];
+		ItemId hitemid;
+		OffsetNumber hoffnum;
+
+		/*
+		 * Read heap buffer, but avoid refetching if it's the same block as
+		 * required for the last tid.
+		 */
+		if (hblkno == InvalidBlockNumber ||
+			ItemPointerGetBlockNumber(htid) != hblkno)
+		{
+			/* release old buffer */
+			if (BufferIsValid(buf))
+			{
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+				ReleaseBuffer(buf);
+			}
+
+			hblkno = ItemPointerGetBlockNumber(htid);
+
+			buf = ReadBuffer(rel, hblkno);
+			hpage = BufferGetPage(buf);
+
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+		}
+
+		hoffnum = ItemPointerGetOffsetNumber(htid);
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			HeapTupleHeader htuphdr;
+
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+	}
+
+	if (BufferIsValid(buf))
+	{
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buf);
+	}
+
+	return latestRemovedXid;
+}
+
+/*
  * Perform XLogInsert to register a heap cleanup info message. These
  * messages are sent once per VACUUM and are required because
  * of the phasing of removal operations during a lazy VACUUM.
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index e0a5ea4..c425ebe 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,42 @@ BuildIndexValueDescription(Relation indexRelation,
 	return buf.data;
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted.
+ */
+TransactionId
+index_compute_xid_horizon_for_tuples(Relation irel,
+									 Relation hrel,
+									 Buffer ibuf,
+									 OffsetNumber *itemnos,
+									 int nitems)
+{
+	ItemPointerData *htids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems);
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		ipage = BufferGetPage(ibuf);
+	IndexTuple	itup;
+
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId iitemid;
+
+		iitemid = PageGetItemId(ipage, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		ItemPointerCopy(&itup->t_tid, &htids[i]);
+	}
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		heap_compute_xid_horizon_for_tuples(hrel, htids, nitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 
 /* ----------------------------------------------------------------
  *		heap-or-index-scan access to system catalogs
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 674b3f1..b917f06 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1034,10 +1034,16 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	TransactionId latestRemovedXid = InvalidTransactionId;
 
 	/* Shouldn't be called unless there's something to do */
 	Assert(nitems > 0);
 
+	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+		latestRemovedXid =
+			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+												 itemnos, nitems);
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
@@ -1068,6 +1074,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		xl_btree_delete xlrec_delete;
 
 		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
+		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.hnode = heapRel->rd_node;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 30f2e62..a8805d1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -518,159 +518,6 @@ btree_xlog_vacuum(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 }
 
-/*
- * Get the latestRemovedXid from the heap pages pointed at by the index
- * tuples being deleted. This puts the work for calculating latestRemovedXid
- * into the recovery path rather than the primary path.
- *
- * It's possible that this generates a fair amount of I/O, since an index
- * block may have hundreds of tuples being deleted. Repeat accesses to the
- * same heap blocks are common, though are not yet optimised.
- *
- * XXX optimise later with something like XLogPrefetchBuffer()
- */
-static TransactionId
-btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
-{
-	xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
-	OffsetNumber *unused;
-	Buffer		ibuffer,
-				hbuffer;
-	Page		ipage,
-				hpage;
-	RelFileNode rnode;
-	BlockNumber blkno;
-	ItemId		iitemid,
-				hitemid;
-	IndexTuple	itup;
-	HeapTupleHeader htuphdr;
-	BlockNumber hblkno;
-	OffsetNumber hoffnum;
-	TransactionId latestRemovedXid = InvalidTransactionId;
-	int			i;
-
-	/*
-	 * If there's nothing running on the standby we don't need to derive a
-	 * full latestRemovedXid value, so use a fast path out of here.  This
-	 * returns InvalidTransactionId, and so will conflict with all HS
-	 * transactions; but since we just worked out that that's zero people,
-	 * it's OK.
-	 *
-	 * XXX There is a race condition here, which is that a new backend might
-	 * start just after we look.  If so, it cannot need to conflict, but this
-	 * coding will result in throwing a conflict anyway.
-	 */
-	if (CountDBBackends(InvalidOid) == 0)
-		return latestRemovedXid;
-
-	/*
-	 * In what follows, we have to examine the previous state of the index
-	 * page, as well as the heap page(s) it points to.  This is only valid if
-	 * WAL replay has reached a consistent database state; which means that
-	 * the preceding check is not just an optimization, but is *necessary*. We
-	 * won't have let in any user sessions before we reach consistency.
-	 */
-	if (!reachedConsistency)
-		elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data");
-
-	/*
-	 * Get index page.  If the DB is consistent, this should not fail, nor
-	 * should any of the heap page fetches below.  If one does, we return
-	 * InvalidTransactionId to cancel all HS transactions.  That's probably
-	 * overkill, but it's safe, and certainly better than panicking here.
-	 */
-	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
-	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
-	if (!BufferIsValid(ibuffer))
-		return InvalidTransactionId;
-	LockBuffer(ibuffer, BT_READ);
-	ipage = (Page) BufferGetPage(ibuffer);
-
-	/*
-	 * Loop through the deleted index items to obtain the TransactionId from
-	 * the heap items they point to.
-	 */
-	unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
-
-	for (i = 0; i < xlrec->nitems; i++)
-	{
-		/*
-		 * Identify the index tuple about to be deleted
-		 */
-		iitemid = PageGetItemId(ipage, unused[i]);
-		itup = (IndexTuple) PageGetItem(ipage, iitemid);
-
-		/*
-		 * Locate the heap page that the index tuple points at
-		 */
-		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
-		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM, hblkno, RBM_NORMAL);
-		if (!BufferIsValid(hbuffer))
-		{
-			UnlockReleaseBuffer(ibuffer);
-			return InvalidTransactionId;
-		}
-		LockBuffer(hbuffer, BT_READ);
-		hpage = (Page) BufferGetPage(hbuffer);
-
-		/*
-		 * Look up the heap tuple header that the index tuple points at by
-		 * using the heap node supplied with the xlrec. We can't use
-		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
-		 * Note that we are not looking at tuple data here, just headers.
-		 */
-		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
-		hitemid = PageGetItemId(hpage, hoffnum);
-
-		/*
-		 * Follow any redirections until we find something useful.
-		 */
-		while (ItemIdIsRedirected(hitemid))
-		{
-			hoffnum = ItemIdGetRedirect(hitemid);
-			hitemid = PageGetItemId(hpage, hoffnum);
-			CHECK_FOR_INTERRUPTS();
-		}
-
-		/*
-		 * If the heap item has storage, then read the header and use that to
-		 * set latestRemovedXid.
-		 *
-		 * Some LP_DEAD items may not be accessible, so we ignore them.
-		 */
-		if (ItemIdHasStorage(hitemid))
-		{
-			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
-
-			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
-		}
-		else if (ItemIdIsDead(hitemid))
-		{
-			/*
-			 * Conjecture: if hitemid is dead then it had xids before the xids
-			 * marked on LP_NORMAL items. So we just ignore this item and move
-			 * onto the next, for the purposes of calculating
-			 * latestRemovedxids.
-			 */
-		}
-		else
-			Assert(!ItemIdIsUsed(hitemid));
-
-		UnlockReleaseBuffer(hbuffer);
-	}
-
-	UnlockReleaseBuffer(ibuffer);
-
-	/*
-	 * If all heap tuples were LP_DEAD then we will be returning
-	 * InvalidTransactionId here, which avoids conflicts. This matches
-	 * existing logic which assumes that LP_DEAD tuples must already be older
-	 * than the latestRemovedXid on the cleanup record that set them as
-	 * LP_DEAD, hence must already have generated a conflict.
-	 */
-	return latestRemovedXid;
-}
-
 static void
 btree_xlog_delete(XLogReaderState *record)
 {
@@ -693,12 +540,11 @@ btree_xlog_delete(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 	{
-		TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
 											xlrec->onCatalogTable, rnode);
 	}
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c4aba39..6176079 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -186,6 +186,11 @@ extern IndexScanDesc RelationGetIndexScan(Relation indexRelation,
 extern void IndexScanEnd(IndexScanDesc scan);
 extern char *BuildIndexValueDescription(Relation indexRelation,
 						   Datum *values, bool *isnull);
+extern TransactionId index_compute_xid_horizon_for_tuples(Relation irel,
+														  Relation hrel,
+														  Buffer ibuf,
+														  OffsetNumber *itemnos,
+														  int nitems);
 
 /*
  * heap-or-index access to system catalogs (in genam.c)
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 455d701..4e3e908 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -264,6 +264,7 @@ typedef struct xl_hash_init_bitmap_page
 typedef struct xl_hash_vacuum_one_page
 {
 	bool		onCatalogTable;
+	TransactionId latestRemovedXid;
 	RelFileNode hnode;
 	int			ntuples;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ab08791..2f05b93 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -166,6 +166,10 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
+extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
+														 ItemPointerData *items,
+														 int nitems);
+
 /* in heap/pruneheap.c */
 extern void heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int heap_page_prune(Relation relation, Buffer buffer,
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 23f950f..aa5f1e2 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -124,6 +124,7 @@ typedef struct xl_btree_split
 typedef struct xl_btree_delete
 {
 	bool		onCatalogTable;
+	TransactionId latestRemovedXid;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	int			nitems;
-- 
2.1.4

0003-WIP-TAP-test-for-logical-decoding-on-standby.patchapplication/octet-stream; name=0003-WIP-TAP-test-for-logical-decoding-on-standby.patchDownload

From 7f6995a26b15ffd5220536cee023ad7472d7e6cb Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Mon, 4 Mar 2019 12:22:50 +0530
Subject: [PATCH 3/3] New TAP test for logical decoding on standby.

new file:   recovery/t/016_logical_decoding_on_replica.pl

Tests originally written by Craig Ringer, with some WIP changes
from Amit Khandekar.
---
 .../recovery/t/016_logical_decoding_on_replica.pl  | 358 +++++++++++++++++++++
 1 file changed, 358 insertions(+)
 create mode 100644 src/test/recovery/t/016_logical_decoding_on_replica.pl

diff --git a/src/test/recovery/t/016_logical_decoding_on_replica.pl b/src/test/recovery/t/016_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..8cc029b
--- /dev/null
+++ b/src/test/recovery/t/016_logical_decoding_on_replica.pl
@@ -0,0 +1,358 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 52;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent
+
+# If no slot on standby exists to hold down catalog_xmin it must follow xmin,
+# (which is nextXid when no xacts are running on the standby).
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+is($xmin, $catalog_xmin, "xmin and catalog_xmin equal");
+
+# We need catalog_xmin advance to take effect on the master and be replayed
+# on standby.
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_xmin, $new_catalog_xmin) = print_phys_xmin();
+# We're now back to the old behaviour of hot_standby_feedback
+# reporting nextXid for both thresholds
+ok($new_catalog_xmin, "physical catalog_xmin still non-null");
+cmp_ok($new_catalog_xmin, '==', $new_xmin,
+	'xmin and catalog_xmin equal after slot drop');
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+#	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+# TODO : Above, it bails out even when pg_recvlogical is successful, commented out BAIL_OUT
+$node_replica->command_ok(['pg_recvlogical', '-v', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+#	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+# TODO : Above, it bails out even when pg_recvlogical is successful, commented out BAIL_OUT
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: Andres Freund (#6)

Re: Minimal logical decoding on standbys

On 03/01/2019 11:16 PM, Andres Freund wrote:

So, if I understand correctly you do*not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Ok, This time around - I performed like this -

.)Master cluster (set wal_level=logical and hot_standby_feedback=on in
postgresql.conf) , start the server and create a physical replication slot

postgres=# SELECT * FROM
pg_create_physical_replication_slot('decoding_standby');
slot_name | lsn
------------------+-----
decoding_standby |
(1 row)

.)Perform pg_basebackup using --slot=decoding_standby with option -R .
modify port=5555 , start the server

.)Connect to slave and create a logical replication slot

postgres=# create table t(n int);
ERROR: cannot execute CREATE TABLE in a read-only transaction
postgres=#

postgres=# SELECT * FROM
pg_create_logical_replication_slot('standby_slot', 'test_decoding');
slot_name | lsn
--------------+-----------
standby_slot | 0/2000060
(1 row)

run pgbench (./pgbench -i -s 10 postgres) against master and
simultaneously- start pg_recvlogical , provide port=5555 ( slave
cluster) and specify slot=standby_slot
./pg_recvlogical -d postgres -p 5555 -s 1 -F 1 -v --slot=standby_slot
--start -f -

[centos@centos-cpula bin]$ ./pg_recvlogical -d postgres -p 5555 -s 1 -F
1 -v --slot=standby_slot --start -f -
pg_recvlogical: starting log streaming at 0/0 (slot standby_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot standby_slot)
pg_recvlogical: confirming write up to 0/30194E8, flush to 0/30194E8
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590
(slot standby_slot)
pg_recvlogical: confirming write up to 0/301D558, flush to 0/301D558
(slot standby_slot)
BEGIN 476
COMMIT 476
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40
(slot standby_slot)
BEGIN 477
COMMIT 477

If we do the same for the logical replication slot which created on
Master cluster

[centos@centos-cpula bin]$ ./pg_recvlogical -d postgres -s 1 -F 1 -v
--slot=master_slot --start -f -
pg_recvlogical: starting log streaming at 0/0 (slot master_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot master_slot)
table public.pgbench_accounts: INSERT: aid[integer]:65057 bid[integer]:1
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65058 bid[integer]:1
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65059 bid[integer]:1
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65060 bid[integer]:1
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65061 bid[integer]:1
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65062 bid[integer]:1
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65063 bid[integer]:1
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65064 bid[integer]:1
abalance[integer]:0 filler[character]:' '

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: tushar (#8)

Re: Minimal logical decoding on standbys

On 03/04/2019 04:54 PM, tushar wrote:

.)Perform pg_basebackup using --slot=decoding_standby with option -R
. modify port=5555 , start the server

set primary_slot_name = 'decoding_standby' in the postgresql.conf file
of slave.

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#10

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: tushar (#8)

Re: Minimal logical decoding on standbys

Hi,

On 2019-03-04 16:54:32 +0530, tushar wrote:

On 03/01/2019 11:16 PM, Andres Freund wrote:

So, if I understand correctly you do*not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Ok, This time aroundï¿½ - I performed like this -

.)Master cluster (set wal_level=logical and hot_standby_feedback=on in
postgresql.conf) , start the server and create a physical replication slot

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).

Thanks,

Andres

#11

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: Andres Freund (#10)

Re: Minimal logical decoding on standbys

On 03/04/2019 10:57 PM, Andres Freund wrote:

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).

Right, This parameter was enabled on both Master and slave.

Is someone able to reproduce this issue ?

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#12

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: Andres Freund (#10)

Re: Minimal logical decoding on standbys

There is an another issue , where i am getting error while executing
"pg_logical_slot_get_changes" on SLAVE

Master (running on port=5432) - run "make installcheck" after setting
PATH=<installation/bin:$PATH ) and export PGDATABASE=postgres from
regress/ folder
Slave (running on port=5555) - Connect to regression database and
select pg_logical_slot_get_changes

[centos@mail-arts bin]$ ./psql postgres -p 5555 -f t.sql
You are now connected to database "regression" as user "centos".
slot_name | lsn
-----------+-----------
m61 | 1/D437AD8
(1 row)

psql:t.sql:3: ERROR: could not resolve cmin/cmax of catalog tuple

[centos@mail-arts bin]$ cat t.sql
\c regression
SELECT * from pg_create_logical_replication_slot('m61', 'test_decoding');
select * from pg_logical_slot_get_changes('m61',null,null);

regards,

On 03/04/2019 10:57 PM, Andres Freund wrote:

Hi,

On 2019-03-04 16:54:32 +0530, tushar wrote:

On 03/01/2019 11:16 PM, Andres Freund wrote:

So, if I understand correctly you do*not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Ok, This time around - I performed like this -

.)Master cluster (set wal_level=logical and hot_standby_feedback=on in
postgresql.conf) , start the server and create a physical replication slot

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).

Thanks,

Andres

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#13

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Amit Khandekar (#7)

Re: Minimal logical decoding on standbys

On Mon, 4 Mar 2019 at 14:09, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Fri, 14 Dec 2018 at 06:25, Andres Freund <andres@anarazel.de> wrote:

I've a prototype attached, but let's discuss the details in a separate
thread. This also needs to be changed for pluggable storage, as we don't
know about table access methods in the startup process, so we can't call
can't determine which AM the heap is from during
btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Attached is a WIP test patch
0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
modified version of Craig Ringer's test cases

Hi Andres,

I am trying to come up with new testcases to test the recovery
conflict handling. Before that I have some queries :

With Craig Ringer's approach, the way to reproduce the recovery
conflict was, I believe, easy : Do a checkpoint, which will log the
global-catalog-xmin-advance WAL record, due to which the standby -
while replaying the message - may find out that it's a recovery
conflict. But with your approach, the latestRemovedXid is passed only
during specific vacuum-related WAL records, so to reproduce the
recovery conflict error, we need to make sure some specific WAL
records are logged, such as XLOG_BTREE_DELETE. So we need to create a
testcase such that while creating an index tuple, it erases dead
tuples from a page, so that it eventually calls
_bt_vacuum_one_page()=>_bt_delitems_delete(), thus logging a
XLOG_BTREE_DELETE record.

I tried to come up with this reproducible testcase without success.
This seems difficult. Do you have an easier option ? May be we can use
some other WAL records that may have easier more reliable test case
for showing up recovery conflict ?

Further, with your patch, in ResolveRecoveryConflictWithSlots(), it
just throws a WARNING error level; so the wal receiver would not make
the backends throw an error; hence the test case won't catch the
error. Is that right ?

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#14

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: tushar (#12)

Re: Minimal logical decoding on standbys

Hi ,

I am getting a server crash on standby while executing
pg_logical_slot_get_changes function , please refer this scenario

Master cluster( ./initdb -D master)
set wal_level='hot_standby in master/postgresql.conf file
start the server , connect to psql terminal and create a physical
replication slot ( SELECT * from
pg_create_physical_replication_slot('p1');)

perform pg_basebackup using --slot 'p1' (./pg_basebackup -D slave/ -R
--slot p1 -v))
set wal_level='logical' , hot_standby_feedback=on,
primary_slot_name='p1' in slave/postgresql.conf file
start the server , connect to psql terminal and create a logical
replication slot ( SELECT * from
pg_create_logical_replication_slot('t','test_decoding');)

run pgbench ( ./pgbench -i -s 10 postgres) on master and select
pg_logical_slot_get_changes on Slave database

postgres=# select * from pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.274 IST [26817] LOG: starting logical decoding for
slot "t"
2019-03-13 20:34:50.274 IST [26817] DETAIL: Streaming transactions
committing after 0/6C000060, reading WAL from 0/6C000028.
2019-03-13 20:34:50.274 IST [26817] STATEMENT: select * from
pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.275 IST [26817] LOG: logical decoding found
consistent point at 0/6C000028
2019-03-13 20:34:50.275 IST [26817] DETAIL: There are no running
transactions.
2019-03-13 20:34:50.275 IST [26817] STATEMENT: select * from
pg_logical_slot_get_changes('t',null,null);
TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File:
"decode.c", Line: 977)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-03-13
20:34:50.276 IST [26809] LOG: server process (PID 26817) was terminated
by signal 6: Aborted

Stack trace -

(gdb) bt
#0 0x00007f370e673277 in raise () from /lib64/libc.so.6
#1 0x00007f370e674968 in abort () from /lib64/libc.so.6
#2 0x0000000000a30edf in ExceptionalCondition (conditionName=0xc36090
"!(data == tupledata + tuplelen)", errorType=0xc35f5c "FailedAssertion",
fileName=0xc35d70 "decode.c",
    lineNumber=977) at assert.c:54
#3 0x0000000000843c6f in DecodeMultiInsert (ctx=0x2ba1ac8,
buf=0x7ffd7a5136d0) at decode.c:977
#4 0x0000000000842b32 in DecodeHeap2Op (ctx=0x2ba1ac8,
buf=0x7ffd7a5136d0) at decode.c:375
#5 0x00000000008424dd in LogicalDecodingProcessRecord (ctx=0x2ba1ac8,
record=0x2ba1d88) at decode.c:125
#6 0x000000000084830d in pg_logical_slot_get_changes_guts
(fcinfo=0x2b95838, confirm=true, binary=false) at logicalfuncs.c:307
#7 0x000000000084846a in pg_logical_slot_get_changes (fcinfo=0x2b95838)
at logicalfuncs.c:376
#8 0x00000000006e5b9f in ExecMakeTableFunctionResult
(setexpr=0x2b93ee8, econtext=0x2b93d98, argContext=0x2b99940,
expectedDesc=0x2b97970, randomAccess=false) at execSRF.c:233
#9 0x00000000006fb738 in FunctionNext (node=0x2b93c80) at
nodeFunctionscan.c:94
#10 0x00000000006e52b1 in ExecScanFetch (node=0x2b93c80,
accessMtd=0x6fb67b <FunctionNext>, recheckMtd=0x6fba77
<FunctionRecheck>) at execScan.c:93
#11 0x00000000006e5326 in ExecScan (node=0x2b93c80, accessMtd=0x6fb67b
<FunctionNext>, recheckMtd=0x6fba77 <FunctionRecheck>) at execScan.c:143
#12 0x00000000006fbac1 in ExecFunctionScan (pstate=0x2b93c80) at
nodeFunctionscan.c:270
#13 0x00000000006e3293 in ExecProcNodeFirst (node=0x2b93c80) at
execProcnode.c:445
#14 0x00000000006d8253 in ExecProcNode (node=0x2b93c80) at
../../../src/include/executor/executor.h:241
#15 0x00000000006daa4e in ExecutePlan (estate=0x2b93a28,
planstate=0x2b93c80, use_parallel_mode=false, operation=CMD_SELECT,
sendTuples=true, numberTuples=0,
    direction=ForwardScanDirection, dest=0x2b907e0, execute_once=true)
at execMain.c:1643
#16 0x00000000006d8865 in standard_ExecutorRun (queryDesc=0x2afff28,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:362
#17 0x00000000006d869b in ExecutorRun (queryDesc=0x2afff28,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:306
#18 0x00000000008ccef1 in PortalRunSelect (portal=0x2b36168,
forward=true, count=0, dest=0x2b907e0) at pquery.c:929
#19 0x00000000008ccb90 in PortalRun (portal=0x2b36168,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x2b907e0, altdest=0x2b907e0, completionTag=0x7ffd7a513e90 "")
    at pquery.c:770
#20 0x00000000008c6b58 in exec_simple_query (query_string=0x2adc1e8
"select * from pg_logical_slot_get_changes('t',null,null);") at
postgres.c:1215
#21 0x00000000008cae88 in PostgresMain (argc=1, argv=0x2b06590,
dbname=0x2b063d0 "postgres", username=0x2ad8da8 "centos") at postgres.c:4256
#22 0x0000000000828464 in BackendRun (port=0x2afe3b0) at postmaster.c:4399
#23 0x0000000000827c42 in BackendStartup (port=0x2afe3b0) at
postmaster.c:4090
#24 0x0000000000824036 in ServerLoop () at postmaster.c:1703
#25 0x00000000008238ec in PostmasterMain (argc=3, argv=0x2ad6d00) at
postmaster.c:1376
#26 0x0000000000748aab in main (argc=3, argv=0x2ad6d00) at main.c:228
(gdb)

regards,

On 03/07/2019 09:03 PM, tushar wrote:

There is an another issue , where i am getting error while executing
"pg_logical_slot_get_changes" on SLAVE

Master (running on port=5432) - run "make installcheck" after
setting PATH=<installation/bin:$PATH ) and export
PGDATABASE=postgres from regress/ folder
Slave (running on port=5555) - Connect to regression database and
select pg_logical_slot_get_changes

[centos@mail-arts bin]$ ./psql postgres -p 5555 -f t.sql
You are now connected to database "regression" as user "centos".
slot_name |    lsn
-----------+-----------
m61       | 1/D437AD8
(1 row)

psql:t.sql:3: ERROR: could not resolve cmin/cmax of catalog tuple

[centos@mail-arts bin]$ cat t.sql
\c regression
SELECT * from   pg_create_logical_replication_slot('m61',
'test_decoding');
select * from pg_logical_slot_get_changes('m61',null,null);

regards,

On 03/04/2019 10:57 PM, Andres Freund wrote:

Hi,

On 2019-03-04 16:54:32 +0530, tushar wrote:

On 03/01/2019 11:16 PM, Andres Freund wrote:

So, if I understand correctly you do*not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Ok, This time around - I performed like this -

.)Master cluster (set wal_level=logical and hot_standby_feedback=on in
postgresql.conf) , start the server and create a physical
replication slot

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).

Thanks,

Andres

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#15

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Amit Khandekar (#13)

Re: Minimal logical decoding on standbys

On Fri, 8 Mar 2019 at 20:59, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Mon, 4 Mar 2019 at 14:09, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Fri, 14 Dec 2018 at 06:25, Andres Freund <andres@anarazel.de> wrote:

I've a prototype attached, but let's discuss the details in a separate
thread. This also needs to be changed for pluggable storage, as we don't
know about table access methods in the startup process, so we can't call
can't determine which AM the heap is from during
btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Attached is a WIP test patch
0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
modified version of Craig Ringer's test cases

Hi Andres,

I am trying to come up with new testcases to test the recovery
conflict handling. Before that I have some queries :

With Craig Ringer's approach, the way to reproduce the recovery
conflict was, I believe, easy : Do a checkpoint, which will log the
global-catalog-xmin-advance WAL record, due to which the standby -
while replaying the message - may find out that it's a recovery
conflict. But with your approach, the latestRemovedXid is passed only
during specific vacuum-related WAL records, so to reproduce the
recovery conflict error, we need to make sure some specific WAL
records are logged, such as XLOG_BTREE_DELETE. So we need to create a
testcase such that while creating an index tuple, it erases dead
tuples from a page, so that it eventually calls
_bt_vacuum_one_page()=>_bt_delitems_delete(), thus logging a
XLOG_BTREE_DELETE record.

I tried to come up with this reproducible testcase without success.
This seems difficult. Do you have an easier option ? May be we can use
some other WAL records that may have easier more reliable test case
for showing up recovery conflict ?

I managed to get a recovery conflict by :
1. Setting hot_standby_feedback to off
2. Creating a logical replication slot on standby
3. Creating a table on master, and insert some data.
2. Running : VACUUM FULL;

This gives WARNING messages in the standby log file.
2019-03-14 14:57:56.833 IST [40076] WARNING: slot decoding_standby w/
catalog xmin 474 conflicts with removed xid 477
2019-03-14 14:57:56.833 IST [40076] CONTEXT: WAL redo at 0/3069E98
for Heap2/CLEAN: remxid 477

But I did not add such a testcase into the test file, because with the
current patch, it does not do anything with the slot; it just keeps on
emitting WARNING in the log file; so we can't test this scenario as of
now using the tap test.

Further, with your patch, in ResolveRecoveryConflictWithSlots(), it
just throws a WARNING error level; so the wal receiver would not make
the backends throw an error; hence the test case won't catch the
error. Is that right ?

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#16

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Amit Khandekar (#15)

Re: Minimal logical decoding on standbys

On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I managed to get a recovery conflict by :
1. Setting hot_standby_feedback to off
2. Creating a logical replication slot on standby
3. Creating a table on master, and insert some data.
2. Running : VACUUM FULL;

This gives WARNING messages in the standby log file.
2019-03-14 14:57:56.833 IST [40076] WARNING: slot decoding_standby w/
catalog xmin 474 conflicts with removed xid 477
2019-03-14 14:57:56.833 IST [40076] CONTEXT: WAL redo at 0/3069E98
for Heap2/CLEAN: remxid 477

But I did not add such a testcase into the test file, because with the
current patch, it does not do anything with the slot; it just keeps on
emitting WARNING in the log file; so we can't test this scenario as of
now using the tap test.

I am going ahead with drop-the-slot way of handling the recovery
conflict. I am trying out using ReplicationSlotDropPtr() to drop the
slot. It seems the required locks are already in place inside the for
loop of ResolveRecoveryConflictWithSlots(), so we can directly call
ReplicationSlotDropPtr() when the slot xmin conflict is found.

As explained above, the only way I could reproduce the conflict is by
turning hot_standby_feedback off on slave, creating and inserting into
a table on master and then running VACUUM FULL. But after doing this,
I am not able to verify whether the slot is dropped, because on slave,
any simple psql command thereon, waits on a lock acquired on sys
catache, e.g. pg_authid. Working on it.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#17

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Amit Khandekar (#16)

Re: Minimal logical decoding on standbys

Hi,

On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote:

On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I managed to get a recovery conflict by :
1. Setting hot_standby_feedback to off
2. Creating a logical replication slot on standby
3. Creating a table on master, and insert some data.
2. Running : VACUUM FULL;

This gives WARNING messages in the standby log file.
2019-03-14 14:57:56.833 IST [40076] WARNING: slot decoding_standby w/
catalog xmin 474 conflicts with removed xid 477
2019-03-14 14:57:56.833 IST [40076] CONTEXT: WAL redo at 0/3069E98
for Heap2/CLEAN: remxid 477

But I did not add such a testcase into the test file, because with the
current patch, it does not do anything with the slot; it just keeps on
emitting WARNING in the log file; so we can't test this scenario as of
now using the tap test.

I am going ahead with drop-the-slot way of handling the recovery
conflict. I am trying out using ReplicationSlotDropPtr() to drop the
slot. It seems the required locks are already in place inside the for
loop of ResolveRecoveryConflictWithSlots(), so we can directly call
ReplicationSlotDropPtr() when the slot xmin conflict is found.

Cool.

As explained above, the only way I could reproduce the conflict is by
turning hot_standby_feedback off on slave, creating and inserting into
a table on master and then running VACUUM FULL. But after doing this,
I am not able to verify whether the slot is dropped, because on slave,
any simple psql command thereon, waits on a lock acquired on sys
catache, e.g. pg_authid. Working on it.

I think that indicates a bug somewhere. If replay progressed, it should
have killed the slot, and continued replaying past the VACUUM
FULL. Those symptoms suggest replay is stuck somewhere. I suggest a)
compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking
at a backtrace of the startup process.

Greetings,

Andres Freund

#18

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Andres Freund (#17)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Tue, 2 Apr 2019 at 21:34, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote:

On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I managed to get a recovery conflict by :
1. Setting hot_standby_feedback to off
2. Creating a logical replication slot on standby
3. Creating a table on master, and insert some data.
2. Running : VACUUM FULL;

This gives WARNING messages in the standby log file.
2019-03-14 14:57:56.833 IST [40076] WARNING: slot decoding_standby w/
catalog xmin 474 conflicts with removed xid 477
2019-03-14 14:57:56.833 IST [40076] CONTEXT: WAL redo at 0/3069E98
for Heap2/CLEAN: remxid 477

But I did not add such a testcase into the test file, because with the
current patch, it does not do anything with the slot; it just keeps on
emitting WARNING in the log file; so we can't test this scenario as of
now using the tap test.

I am going ahead with drop-the-slot way of handling the recovery
conflict. I am trying out using ReplicationSlotDropPtr() to drop the
slot. It seems the required locks are already in place inside the for
loop of ResolveRecoveryConflictWithSlots(), so we can directly call
ReplicationSlotDropPtr() when the slot xmin conflict is found.

Cool.

As explained above, the only way I could reproduce the conflict is by
turning hot_standby_feedback off on slave, creating and inserting into
a table on master and then running VACUUM FULL. But after doing this,
I am not able to verify whether the slot is dropped, because on slave,
any simple psql command thereon, waits on a lock acquired on sys
catache, e.g. pg_authid. Working on it.

I think that indicates a bug somewhere. If replay progressed, it should
have killed the slot, and continued replaying past the VACUUM
FULL. Those symptoms suggest replay is stuck somewhere. I suggest a)
compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking
at a backtrace of the startup process.

Oops, it was my own change that caused the hang. Sorry for the noise.
After using wal_debug, found out that after replaying the LOCK records
for the catalog pg_auth, it was not releasing it because it had
actually got stuck in ReplicationSlotDropPtr() itself. In
ResolveRecoveryConflictWithSlots(), a shared
ReplicationSlotControlLock was already held before iterating through
the slots, and now ReplicationSlotDropPtr() again tries to take the
same lock in exclusive mode for setting slot->in_use, leading to a
deadlock. I fixed that by releasing the shared lock before calling
ReplicationSlotDropPtr(), and then re-starting the slots' scan over
again since we released it. We do similar thing for
ReplicationSlotCleanup().

Attached is a rebased version of your patch
logical-decoding-on-standby.patch. This v2 version also has the above
changes. It also includes the tap test file which is still in WIP
state, mainly because I have yet to add the conflict recovery handling
scenarios.

I see that you have already committed the
move-latestRemovedXid-computation-for-nbtree-xlog related changes.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v2.patchapplication/octet-stream; name=logical-decoding-on-standby_v2.patchDownload

From a508f1c38ff689ec5a8d9df371fd941d547fa479 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 3 Apr 2019 19:26:49 +0530
Subject: [PATCH] Logical decoding on standby.

-Andres Freund.

Besides the above main changes by Andres, following changes done by
Amit Khandekar :

1. Handle slot conflict recovery by dropping the conflicting slots.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.
This test is originally written by Craig Ringer, with some changes
from Amit Khandekar. Still in WIP state. Yet to add scenarios to test
conflict recovery.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   3 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/replication/logical/logical.c          |   2 +
 src/backend/replication/slot.c                     |  79 +++++
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 .../recovery/t/016_logical_decoding_on_replica.pl  | 358 +++++++++++++++++++++
 24 files changed, 513 insertions(+), 18 deletions(-)
 create mode 100644 src/test/recovery/t/016_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb80ab0..ccb761f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -342,7 +342,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -563,7 +564,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -758,6 +759,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index e17f017..b67e4e6 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 05ceb65..f5439d2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7097,12 +7097,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7138,6 +7139,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7188,6 +7190,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7218,7 +7221,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7228,6 +7231,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7648,7 +7652,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7684,7 +7689,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7780,7 +7786,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7917,7 +7925,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 392b35e..6959119 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -465,7 +465,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8ade165..745cbc5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0a85d8b..2617d55 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index b9311ce..ef4910f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 71836ee..c66137a 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -913,6 +913,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e5bc12..e8b7af4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,7 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +112,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b..5785d2f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != InvalidOid && s->data.database != dboid)
+			continue;
+
+		SpinLockAcquire(&s->mutex);
+		slotname = s->data.name;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+		SpinLockRelease(&s->mutex);
+
+		if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_xmin, xid)));
+		}
+
+		if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_catalog_xmin, xid)));
+		}
+
+
+		if (found_conflict)
+		{
+			elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 215f146..75dbdb9 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 1089556..92a6ed1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1896,6 +1898,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 2f87b67..5eb0c71 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -47,10 +47,10 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -95,6 +95,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 22cd13c..482c874 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 					  TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 						  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 6527fc9..50f334a 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8f1d66..4e0776a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2361243..f276c7e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-									RelFileNode node);
+									bool catalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 9606d02..78bc639 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 5402851..d6437d6 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/recovery/t/016_logical_decoding_on_replica.pl b/src/test/recovery/t/016_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..8cc029b
--- /dev/null
+++ b/src/test/recovery/t/016_logical_decoding_on_replica.pl
@@ -0,0 +1,358 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 52;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent
+
+# If no slot on standby exists to hold down catalog_xmin it must follow xmin,
+# (which is nextXid when no xacts are running on the standby).
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+is($xmin, $catalog_xmin, "xmin and catalog_xmin equal");
+
+# We need catalog_xmin advance to take effect on the master and be replayed
+# on standby.
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_xmin, $new_catalog_xmin) = print_phys_xmin();
+# We're now back to the old behaviour of hot_standby_feedback
+# reporting nextXid for both thresholds
+ok($new_catalog_xmin, "physical catalog_xmin still non-null");
+cmp_ok($new_catalog_xmin, '==', $new_xmin,
+	'xmin and catalog_xmin equal after slot drop');
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+#	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+# TODO : Above, it bails out even when pg_recvlogical is successful, commented out BAIL_OUT
+$node_replica->command_ok(['pg_recvlogical', '-v', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+#	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+# TODO : Above, it bails out even when pg_recvlogical is successful, commented out BAIL_OUT
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#19

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Amit Khandekar (#18)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Wed, 3 Apr 2019 at 19:57, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Oops, it was my own change that caused the hang. Sorry for the noise.
After using wal_debug, found out that after replaying the LOCK records
for the catalog pg_auth, it was not releasing it because it had
actually got stuck in ReplicationSlotDropPtr() itself. In
ResolveRecoveryConflictWithSlots(), a shared
ReplicationSlotControlLock was already held before iterating through
the slots, and now ReplicationSlotDropPtr() again tries to take the
same lock in exclusive mode for setting slot->in_use, leading to a
deadlock. I fixed that by releasing the shared lock before calling
ReplicationSlotDropPtr(), and then re-starting the slots' scan over
again since we released it. We do similar thing for
ReplicationSlotCleanup().

Attached is a rebased version of your patch
logical-decoding-on-standby.patch. This v2 version also has the above
changes. It also includes the tap test file which is still in WIP
state, mainly because I have yet to add the conflict recovery handling
scenarios.

Attached v3 patch includes a new scenario to test conflict recovery
handling by verifying that the conflicting slot gets dropped.

WIth this, I am done with the test changes, except the below question
that I had posted earlier which I would like to have inputs :

I am not sure what is expected. What actually happens is : the
physical xlot catalog_xmin remains NULL initially, but becomes
non-NULL after the logical replication slot is created on standby.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v3.patchapplication/octet-stream; name=logical-decoding-on-standby_v3.patchDownload

From a508f1c38ff689ec5a8d9df371fd941d547fa479 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 3 Apr 2019 19:26:49 +0530
Subject: [PATCH] Logical decoding on standby.

-Andres Freund.

Besides the above main changes by Andres, following changes done by
Amit Khandekar :

1. Handle slot conflict recovery by dropping the conflicting slots.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.
This test is originally written by Craig Ringer, with some changes
from Amit Khandekar. Still in WIP state. Yet to add scenarios to test
conflict recovery.

Incremental changes in v3 : Added recovery handling scenario.

---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   3 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/replication/logical/logical.c          |   2 +
 src/backend/replication/slot.c                     |  79 +++++
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 .../recovery/t/016_logical_decoding_on_replica.pl  | 358 +++++++++++++++++++++
 24 files changed, 513 insertions(+), 18 deletions(-)
 create mode 100644 src/test/recovery/t/016_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 4fb1855..59a7910 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -342,7 +342,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -544,7 +545,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -736,6 +737,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index e17f017..b67e4e6 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a05b6a0..bfbb9d3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7100,12 +7100,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7141,6 +7142,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7191,6 +7193,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7221,7 +7224,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7231,6 +7234,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7651,7 +7655,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7687,7 +7692,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7783,7 +7789,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7920,7 +7928,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c9d8312..fad08e0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -475,7 +475,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8ade165..745cbc5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0a85d8b..2617d55 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index b9311ce..ef4910f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e5bc12..e8b7af4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,7 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +112,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b..5785d2f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != InvalidOid && s->data.database != dboid)
+			continue;
+
+		SpinLockAcquire(&s->mutex);
+		slotname = s->data.name;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+		SpinLockRelease(&s->mutex);
+
+		if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_xmin, xid)));
+		}
+
+		if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_catalog_xmin, xid)));
+		}
+
+
+		if (found_conflict)
+		{
+			elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 215f146..75dbdb9 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 1089556..92a6ed1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1896,6 +1898,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 9990d97..887a377 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -47,10 +47,10 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -95,6 +95,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 22cd13c..482c874 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 					  TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 						  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index ee8fc6f..d535441 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8f1d66..4e0776a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2361243..f276c7e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-									RelFileNode node);
+									bool catalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 9606d02..78bc639 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 89a7fbf..c36e228 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/recovery/t/016_logical_decoding_on_replica.pl b/src/test/recovery/t/016_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..7998d85
--- /dev/null
+++ b/src/test/recovery/t/016_logical_decoding_on_replica.pl
@@ -0,0 +1,386 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 55;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent
+
+# If no slot on standby exists to hold down catalog_xmin it must follow xmin,
+# (which is nextXid when no xacts are running on the standby).
+($xmin, $catalog_xmin) = print_phys_xmin();
+ok($xmin, "xmin not null");
+is($xmin, $catalog_xmin, "xmin and catalog_xmin equal");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now the standby's slot
+# doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream catalog retention
+#########################################################
+
+sub test_catalog_xmin_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $oldestCatalogXmin, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestCatalogXmin:\s*(\d+)/)
+		{
+			$oldestCatalogXmin = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, oldestCatalogXmin $oldestCatalogXmin, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid, $oldestCatalogXmin);
+}
+
+my ($oldestXid, $oldestCatalogXmin) = test_catalog_xmin_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+#
+#
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_xmin, $new_catalog_xmin) = print_phys_xmin();
+# We're now back to the old behaviour of hot_standby_feedback
+# reporting nextXid for both thresholds
+ok($new_catalog_xmin, "physical catalog_xmin still non-null");
+cmp_ok($new_catalog_xmin, '==', $new_xmin,
+	'xmin and catalog_xmin equal after slot drop');
+
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+#	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+# TODO : Above, it bails out even when pg_recvlogical is successful, commented out BAIL_OUT
+$node_replica->command_ok(['pg_recvlogical', '-v', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+#	or BAIL_OUT('slot creation failed, subsequent results would be meaningless');
+# TODO : Above, it bails out even when pg_recvlogical is successful, commented out BAIL_OUT
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');

#20

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Amit Khandekar (#19)

Re: Minimal logical decoding on standbys

Hi,

Thanks for the new version of the patch. Btw, could you add Craig as a
co-author in the commit message of the next version of the patch? Don't
want to forget him.

On 2019-04-05 17:08:39 +0530, Amit Khandekar wrote:

Regarding the test result failures, I could see that when we drop a
logical replication slot at standby server, then the catalog_xmin of
physical replication slot becomes NULL, whereas the test expects it to
be equal to xmin; and that's the reason a couple of test scenarios are
failing :

ok 33 - slot on standby dropped manually
Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
done
not ok 34 - physical catalog_xmin still non-null
not ok 35 - xmin and catalog_xmin equal after slot drop
# Failed test 'xmin and catalog_xmin equal after slot drop'
# at t/016_logical_decoding_on_replica.pl line 272.
# got:
# expected: 2584

I am not sure what is expected. What actually happens is : the
physical xlot catalog_xmin remains NULL initially, but becomes
non-NULL after the logical replication slot is created on standby.

That seems like the correct behaviour to me - why would we still have a
catalog xmin if there's no slot logical slot?

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b..5785d2f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
}
}

+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != InvalidOid && s->data.database != dboid)
+			continue;
+
+		SpinLockAcquire(&s->mutex);
+		slotname = s->data.name;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+		SpinLockRelease(&s->mutex);
+
+		if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_xmin, xid)));
+		}
+
+		if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_catalog_xmin, xid)));
+		}
+
+
+		if (found_conflict)
+		{
+			elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}

I think this should be refactored so that the two found_conflict cases
set a 'reason' variable (perhaps an enum?) to the particular reason, and
then only one warning should be emitted. I also think that LOG might be
more appropriate than WARNING - as confusing as that is, LOG is more
severe than WARNING (see docs about log_min_messages).

@@ -0,0 +1,386 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 55;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent

Can we make this more robust? E.g. by waiting till pg_stat_replication
shows the change on the primary? Because I can guarantee that this'll
fail on slow buildfarm machines (say the valgrind animals).

+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent

Similar.

Greetings,

Andres Freund

#21

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Andres Freund (#20)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Sat, 6 Apr 2019 at 04:45, Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for the new version of the patch. Btw, could you add Craig as a
co-author in the commit message of the next version of the patch? Don't
want to forget him.

I had put his name in the earlier patch. But now I have made it easier to spot.

On 2019-04-05 17:08:39 +0530, Amit Khandekar wrote:

Regarding the test result failures, I could see that when we drop a
logical replication slot at standby server, then the catalog_xmin of
physical replication slot becomes NULL, whereas the test expects it to
be equal to xmin; and that's the reason a couple of test scenarios are
failing :

ok 33 - slot on standby dropped manually
Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
done
not ok 34 - physical catalog_xmin still non-null
not ok 35 - xmin and catalog_xmin equal after slot drop
# Failed test 'xmin and catalog_xmin equal after slot drop'
# at t/016_logical_decoding_on_replica.pl line 272.
# got:
# expected: 2584

I am not sure what is expected. What actually happens is : the
physical xlot catalog_xmin remains NULL initially, but becomes
non-NULL after the logical replication slot is created on standby.

That seems like the correct behaviour to me - why would we still have a
catalog xmin if there's no slot logical slot?

Yeah ... In the earlier implementation, maybe it was different, that's
why the catalog_xmin didn't become NULL. Not sure. Anyways, I have
changed this check. Details in the following sections.

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b..5785d2f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
}
}

+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+     int                     i;
+     bool            found_conflict = false;
+
+     if (max_replication_slots <= 0)
+             return;
+
+restart:
+     if (found_conflict)
+     {
+             CHECK_FOR_INTERRUPTS();
+             /*
+              * Wait awhile for them to die so that we avoid flooding an
+              * unresponsive backend when system is heavily loaded.
+              */
+             pg_usleep(100000);
+             found_conflict = false;
+     }
+
+     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+     for (i = 0; i < max_replication_slots; i++)
+     {
+             ReplicationSlot *s;
+             NameData        slotname;
+             TransactionId slot_xmin;
+             TransactionId slot_catalog_xmin;
+
+             s = &ReplicationSlotCtl->replication_slots[i];
+
+             /* cannot change while ReplicationSlotCtlLock is held */
+             if (!s->in_use)
+                     continue;
+
+             /* not our database, skip */
+             if (s->data.database != InvalidOid && s->data.database != dboid)
+                     continue;
+
+             SpinLockAcquire(&s->mutex);
+             slotname = s->data.name;
+             slot_xmin = s->data.xmin;
+             slot_catalog_xmin = s->data.catalog_xmin;
+             SpinLockRelease(&s->mutex);
+
+             if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+             {
+                     found_conflict = true;
+
+                     ereport(WARNING,
+                                     (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+                                                     NameStr(slotname), slot_xmin, xid)));
+             }
+
+             if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+             {
+                     found_conflict = true;
+
+                     ereport(WARNING,
+                                     (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+                                                     NameStr(slotname), slot_catalog_xmin, xid)));
+             }
+
+
+             if (found_conflict)
+             {
+                     elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+                     LWLockRelease(ReplicationSlotControlLock);      /* avoid deadlock */
+                     ReplicationSlotDropPtr(s);
+
+                     /* We released the lock above; so re-scan the slots. */
+                     goto restart;
+             }
+     }

What I have in mind is :

ereport(LOG,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("Dropping conflicting slot %s", s->data.name.data),
errdetail("%s, removed xid %d.", conflict_str, xid)));
where conflict_str is a dynamically generated string containing
something like : "slot xmin : 1234, slot catalog_xmin: 5678"
So for the user, the errdetail will look like :
"slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
I think the user can figure out whether it was xmin or catalog_xmin or
both that conflicted with removed xid.
If we don't do this way, we may not be able to show in a single
message if both xmin and catalog_xmin are conflicting at the same
time.

Does this message look good to you, or you had in mind something quite
different ?

@@ -0,0 +1,386 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 55;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+sub print_phys_xmin
+{
+     my $slot = $node_master->slot('decoding_standby');
+     return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+     $node_master, $backup_name,
+     has_streaming => 1,
+     has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+     q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+sleep(2); # ensure walreceiver feedback sent

+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent

Similar.

Ok. I have put a copy of the get_slot_xmins() function from
t/001_stream_rep.pl() into 016_logical_decoding_on_replica.pl. Renamed
it to wait_for_phys_mins(). And used this to wait for the
hot_standby_feedback change to propagate to master. This function
waits for the physical slot's xmin and catalog_xmin to get the right
values depending on whether there is a logical slot in standby and
whether hot_standby_feedback is on on standby.

I was not sure how pg_stat_replication could be used to identify about
hot_standby_feedback change reaching to master. So i did the above
way, which I think pretty much does what we want, I think.

Attached v4 patch only has the testcase change, and some minor cleanup
in the test file.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v4.patchapplication/x-patch; name=logical-decoding-on-standby_v4.patchDownload

From 1e3c68a644da4aa45ca72190cfa254ccd171f9e3 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 9 Apr 2019 22:06:25 +0530
Subject: [PATCH] Logical decoding on standby.

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   3 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/replication/logical/logical.c          |   2 +
 src/backend/replication/slot.c                     |  79 +++++
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 .../recovery/t/016_logical_decoding_on_replica.pl  | 391 +++++++++++++++++++++
 24 files changed, 546 insertions(+), 18 deletions(-)
 create mode 100644 src/test/recovery/t/016_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 4fb1855..59a7910 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -342,7 +342,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -544,7 +545,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -736,6 +737,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index e17f017..b67e4e6 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a05b6a0..bfbb9d3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7100,12 +7100,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7141,6 +7142,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7191,6 +7193,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7221,7 +7224,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7231,6 +7234,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7651,7 +7655,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7687,7 +7692,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7783,7 +7789,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7920,7 +7928,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c9d8312..fad08e0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -475,7 +475,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8ade165..745cbc5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0a85d8b..2617d55 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index b9311ce..ef4910f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e5bc12..e8b7af4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,7 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +112,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b..5785d2f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != InvalidOid && s->data.database != dboid)
+			continue;
+
+		SpinLockAcquire(&s->mutex);
+		slotname = s->data.name;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+		SpinLockRelease(&s->mutex);
+
+		if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_xmin, xid)));
+		}
+
+		if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_catalog_xmin, xid)));
+		}
+
+
+		if (found_conflict)
+		{
+			elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 215f146..75dbdb9 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f..f4da4bc 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 9990d97..887a377 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -47,10 +47,10 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	RelFileNode hnode;			/* RelFileNode of the heap the index currently
 								 * points at */
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -95,6 +95,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 22cd13c..482c874 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 					  TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 						  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index ee8fc6f..d535441 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8f1d66..4e0776a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2361243..f276c7e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-									RelFileNode node);
+									bool catalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 9606d02..78bc639 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 89a7fbf..c36e228 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/recovery/t/016_logical_decoding_on_replica.pl b/src/test/recovery/t/016_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..9ee79b0
--- /dev/null
+++ b/src/test/recovery/t/016_logical_decoding_on_replica.pl
@@ -0,0 +1,391 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+$node_replica->command_ok(['pg_recvlogical', '-v', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#22

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: tushar (#14)

Re: Minimal logical decoding on standbys

On 03/13/2019 08:40 PM, tushar wrote:

Hi ,

I am getting a server crash on standby while executing
pg_logical_slot_get_changes function , please refer this scenario

Master cluster( ./initdb -D master)
set wal_level='hot_standby in master/postgresql.conf file
start the server , connect to psql terminal and create a physical
replication slot ( SELECT * from
pg_create_physical_replication_slot('p1');)

perform pg_basebackup using --slot 'p1' (./pg_basebackup -D slave/ -R
--slot p1 -v))
set wal_level='logical' , hot_standby_feedback=on,
primary_slot_name='p1' in slave/postgresql.conf file
start the server , connect to psql terminal and create a logical
replication slot ( SELECT * from
pg_create_logical_replication_slot('t','test_decoding');)

run pgbench ( ./pgbench -i -s 10 postgres) on master and select
pg_logical_slot_get_changes on Slave database

postgres=# select * from pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.274 IST [26817] LOG: starting logical decoding
for slot "t"
2019-03-13 20:34:50.274 IST [26817] DETAIL: Streaming transactions
committing after 0/6C000060, reading WAL from 0/6C000028.
2019-03-13 20:34:50.274 IST [26817] STATEMENT: select * from
pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.275 IST [26817] LOG: logical decoding found
consistent point at 0/6C000028
2019-03-13 20:34:50.275 IST [26817] DETAIL: There are no running
transactions.
2019-03-13 20:34:50.275 IST [26817] STATEMENT: select * from
pg_logical_slot_get_changes('t',null,null);
TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File:
"decode.c", Line: 977)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-03-13
20:34:50.276 IST [26809] LOG: server process (PID 26817) was
terminated by signal 6: Aborted

Andres - Do you think - this is an issue which needs to be fixed ?

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#23

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: tushar (#22)

Re: Minimal logical decoding on standbys

Hi,

On 2019-04-10 12:11:21 +0530, tushar wrote:

On 03/13/2019 08:40 PM, tushar wrote:

Hi ,

I am getting a server crash on standby while executing
pg_logical_slot_get_changes function ï¿½ , please refer this scenario

Master cluster( ./initdb -D master)
set wal_level='hot_standby in master/postgresql.conf file
start the server , connect toï¿½ psql terminal and create a physical
replication slot ( SELECT * from
pg_create_physical_replication_slot('p1');)

perform pg_basebackup using --slot 'p1'ï¿½ (./pg_basebackup -D slave/ -R
--slot p1 -v))
set wal_level='logical' , hot_standby_feedback=on,
primary_slot_name='p1' in slave/postgresql.conf file
start the server , connect to psql terminal and create a logical
replication slot (ï¿½ SELECT * from
pg_create_logical_replication_slot('t','test_decoding');)

run pgbench ( ./pgbench -i -s 10 postgres) on master and select
pg_logical_slot_get_changes on Slave database

postgres=# select * from pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.274 IST [26817] LOG:ï¿½ starting logical decoding for
slot "t"
2019-03-13 20:34:50.274 IST [26817] DETAIL:ï¿½ Streaming transactions
committing after 0/6C000060, reading WAL from 0/6C000028.
2019-03-13 20:34:50.274 IST [26817] STATEMENT:ï¿½ select * from
pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.275 IST [26817] LOG:ï¿½ logical decoding found
consistent point at 0/6C000028
2019-03-13 20:34:50.275 IST [26817] DETAIL:ï¿½ There are no running
transactions.
2019-03-13 20:34:50.275 IST [26817] STATEMENT:ï¿½ select * from
pg_logical_slot_get_changes('t',null,null);
TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File:
"decode.c", Line: 977)
server closed the connection unexpectedly
ï¿½ï¿½ï¿½ This probably means the server terminated abnormally
ï¿½ï¿½ï¿½ before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-03-13
20:34:50.276 IST [26809] LOG:ï¿½ server process (PID 26817) was terminated
by signal 6: Aborted

Andres - Do you think - this is an issue which needs toï¿½ be fixed ?

Yes, it definitely needs to be fixed. I just haven't had sufficient time
to look into it. Have you reproduced this with Amit's latest version?

Amit, have you spent any time looking into it? I know that you're not
that deeply steeped into the internals of logical decoding, but perhaps
there's something obvious going on.

Greetings,

Andres Freund

#24

tushar

tushar.ahuja@enterprisedb.com

almost 7 years ago

In reply to: Andres Freund (#23)

Re: Minimal logical decoding on standbys

On 04/10/2019 09:39 PM, Andres Freund wrote:

Have you reproduced this with Amit's latest version?

Yes-it is very much reproducible.

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#25

Amit Khandekar

amitdkhan.pg@gmail.com

almost 7 years ago

In reply to: Andres Freund (#23)

Re: Minimal logical decoding on standbys

On Wed, 10 Apr 2019 at 21:39, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-10 12:11:21 +0530, tushar wrote:

On 03/13/2019 08:40 PM, tushar wrote:

Hi ,

I am getting a server crash on standby while executing
pg_logical_slot_get_changes function , please refer this scenario

Master cluster( ./initdb -D master)
set wal_level='hot_standby in master/postgresql.conf file
start the server , connect to psql terminal and create a physical
replication slot ( SELECT * from
pg_create_physical_replication_slot('p1');)

perform pg_basebackup using --slot 'p1' (./pg_basebackup -D slave/ -R
--slot p1 -v))
set wal_level='logical' , hot_standby_feedback=on,
primary_slot_name='p1' in slave/postgresql.conf file
start the server , connect to psql terminal and create a logical
replication slot ( SELECT * from
pg_create_logical_replication_slot('t','test_decoding');)

run pgbench ( ./pgbench -i -s 10 postgres) on master and select
pg_logical_slot_get_changes on Slave database

postgres=# select * from pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.274 IST [26817] LOG: starting logical decoding for
slot "t"
2019-03-13 20:34:50.274 IST [26817] DETAIL: Streaming transactions
committing after 0/6C000060, reading WAL from 0/6C000028.
2019-03-13 20:34:50.274 IST [26817] STATEMENT: select * from
pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.275 IST [26817] LOG: logical decoding found
consistent point at 0/6C000028
2019-03-13 20:34:50.275 IST [26817] DETAIL: There are no running
transactions.
2019-03-13 20:34:50.275 IST [26817] STATEMENT: select * from
pg_logical_slot_get_changes('t',null,null);
TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File:
"decode.c", Line: 977)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-03-13
20:34:50.276 IST [26809] LOG: server process (PID 26817) was terminated
by signal 6: Aborted

Andres - Do you think - this is an issue which needs to be fixed ?

Yes, it definitely needs to be fixed. I just haven't had sufficient time
to look into it. Have you reproduced this with Amit's latest version?

Amit, have you spent any time looking into it? I know that you're not
that deeply steeped into the internals of logical decoding, but perhaps
there's something obvious going on.

I tried to see if I can quickly understand what's going on.

Here, master wal_level is hot_standby, not logical, though slave
wal_level is logical.

On slave, when pg_logical_slot_get_changes() is run, in
DecodeMultiInsert(), it does not get any WAL records having
XLH_INSERT_CONTAINS_NEW_TUPLE set. So data pointer is never
incremented, it remains at tupledata. So at the end of the function,
this assertion fails :
Assert(data == tupledata + tuplelen);
because data is actually at tupledata.

Not sure why this is happening. On slave, wal_level is logical, so
logical records should have tuple data. Not sure what does that have
to do with wal_level of master. Everything should be there on slave
after it replays the inserts; and also slave wal_level is logical.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#26

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Amit Khandekar (#25)

Re: Minimal logical decoding on standbys

Hi,

On 2019-04-12 23:34:02 +0530, Amit Khandekar wrote:

I tried to see if I can quickly understand what's going on.

Here, master wal_level is hot_standby, not logical, though slave
wal_level is logical.

Oh, that's well diagnosed. Cool. Also nicely tested - this'd be ugly
in production.

I assume the problem isn't present if you set the primary to wal_level =
logical?

Not sure why this is happening. On slave, wal_level is logical, so
logical records should have tuple data. Not sure what does that have
to do with wal_level of master. Everything should be there on slave
after it replays the inserts; and also slave wal_level is logical.

The standby doesn't write its own WAL, only primaries do. I thought we
forbade running with wal_level=logical on a standby, when the primary is
only set to replica. But that's not what we do, see
CheckRequiredParameterValues().

I've not yet thought this through, but I think we'll have to somehow
error out in this case. I guess we could just check at the start of
decoding what ControlFile->wal_level is set to, and then raise an error
in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
wal_level to something lower?

Could you try to implement that?

Greetings,

Andres Freund

#27

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#26)

Re: Minimal logical decoding on standbys

On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-12 23:34:02 +0530, Amit Khandekar wrote:

I tried to see if I can quickly understand what's going on.

Here, master wal_level is hot_standby, not logical, though slave
wal_level is logical.

Oh, that's well diagnosed. Cool. Also nicely tested - this'd be ugly
in production.

Tushar had made me aware of the fact that this reproduces only when
master wal_level is hot_standby.

I assume the problem isn't present if you set the primary to wal_level =
logical?

Right.

Not sure why this is happening. On slave, wal_level is logical, so
logical records should have tuple data. Not sure what does that have
to do with wal_level of master. Everything should be there on slave
after it replays the inserts; and also slave wal_level is logical.

The standby doesn't write its own WAL, only primaries do. I thought we
forbade running with wal_level=logical on a standby, when the primary is
only set to replica. But that's not what we do, see
CheckRequiredParameterValues().

I've not yet thought this through, but I think we'll have to somehow
error out in this case. I guess we could just check at the start of
decoding what ControlFile->wal_level is set to,

By "start of decoding", I didn't get where exactly. Do you mean
CheckLogicalDecodingRequirements() ?

and then raise an error
in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
wal_level to something lower?

Didn't get where exactly we should error out. We don't do
XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
something else, which I didn't understand.

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be

= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical

slot when master has incompatible wal_level.

ControlFile is not accessible outside xlog.c so need to have an API to
extract this field.

Could you try to implement that?

Greetings,

Andres Freund

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#28

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#27)

Re: Minimal logical decoding on standbys

Hi,

Sorry for the late response.

On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:

On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:

Not sure why this is happening. On slave, wal_level is logical, so
logical records should have tuple data. Not sure what does that have
to do with wal_level of master. Everything should be there on slave
after it replays the inserts; and also slave wal_level is logical.

The standby doesn't write its own WAL, only primaries do. I thought we
forbade running with wal_level=logical on a standby, when the primary is
only set to replica. But that's not what we do, see
CheckRequiredParameterValues().

I've not yet thought this through, but I think we'll have to somehow
error out in this case. I guess we could just check at the start of
decoding what ControlFile->wal_level is set to,

By "start of decoding", I didn't get where exactly. Do you mean
CheckLogicalDecodingRequirements() ?

Right.

and then raise an error
in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
wal_level to something lower?

Didn't get where exactly we should error out. We don't do
XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
something else, which I didn't understand.

I was indeed thinking of checking XLOG_PARAMETER_CHANGE in
decode.c. Adding handling for that, and just checking wal_level, ought
to be fairly doable? But, see below:

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be

= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical

slot when master has incompatible wal_level.

That still allows the primary to change wal_level after logical decoding
has started, so we need the additional checks.

I'm not yet sure how to best deal with the fact that wal_level might be
changed by the primary at basically all times. We would eventually get
an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
that's not necessarily sufficient - if a primary changes its wal_level
to lower, it could remove information logical decoding needs *before*
logical decoding reaches the XLOG_PARAMETER_CHANGE record.

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

Therefore I think the check in CheckLogicalDecodingRequirements() needs
to be something like:

if (RecoveryInProgress())
{
if (!InHotStandby)
ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
/*
* This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
* wal_level has changed, we verify that there are no existin glogical
* replication slots. And to avoid races around creating a new slot,
* CheckLogicalDecodingRequirements() is called once before creating the slot,
* andd once when logical decoding is initially starting up.
*/
if (ControlFile->wal_level != LOGICAL)
ereport(ERROR, "...");
}

And then add a second CheckLogicalDecodingRequirements() call into
CreateInitDecodingContext().

What do you think?

Greetings,

Andres Freund

#29

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#28)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

I am going through you comments. Meanwhile, attached is a rebased
version of the v4 patch.

On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:

Hi,

Sorry for the late response.

On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:

On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:

Not sure why this is happening. On slave, wal_level is logical, so
logical records should have tuple data. Not sure what does that have
to do with wal_level of master. Everything should be there on slave
after it replays the inserts; and also slave wal_level is logical.

The standby doesn't write its own WAL, only primaries do. I thought we
forbade running with wal_level=logical on a standby, when the primary is
only set to replica. But that's not what we do, see
CheckRequiredParameterValues().

I've not yet thought this through, but I think we'll have to somehow
error out in this case. I guess we could just check at the start of
decoding what ControlFile->wal_level is set to,

By "start of decoding", I didn't get where exactly. Do you mean
CheckLogicalDecodingRequirements() ?

Right.

and then raise an error
in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
wal_level to something lower?

Didn't get where exactly we should error out. We don't do
XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
something else, which I didn't understand.

I was indeed thinking of checking XLOG_PARAMETER_CHANGE in
decode.c. Adding handling for that, and just checking wal_level, ought
to be fairly doable? But, see below:

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be

= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical

slot when master has incompatible wal_level.

That still allows the primary to change wal_level after logical decoding
has started, so we need the additional checks.

I'm not yet sure how to best deal with the fact that wal_level might be
changed by the primary at basically all times. We would eventually get
an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
that's not necessarily sufficient - if a primary changes its wal_level
to lower, it could remove information logical decoding needs *before*
logical decoding reaches the XLOG_PARAMETER_CHANGE record.

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

Therefore I think the check in CheckLogicalDecodingRequirements() needs
to be something like:

if (RecoveryInProgress())
{
if (!InHotStandby)
ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
/*
* This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
* wal_level has changed, we verify that there are no existin glogical
* replication slots. And to avoid races around creating a new slot,
* CheckLogicalDecodingRequirements() is called once before creating the slot,
* andd once when logical decoding is initially starting up.
*/
if (ControlFile->wal_level != LOGICAL)
ereport(ERROR, "...");
}

And then add a second CheckLogicalDecodingRequirements() call into
CreateInitDecodingContext().

What do you think?

Greetings,

Andres Freund

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v4_rebased.patchapplication/octet-stream; name=logical-decoding-on-standby_v4_rebased.patchDownload

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index e17f017..b67e4e6 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 19d2c52..7a15b35 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7117,12 +7117,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7158,6 +7159,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7208,6 +7210,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7238,7 +7241,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7248,6 +7251,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7668,7 +7672,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7704,7 +7709,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7800,7 +7806,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7937,7 +7945,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9e17acc1..a8b73e4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e7c40cb..75a6c24 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index fc85c6f..ca750e6 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index acb4d9a..31951bd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,7 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +112,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..1bc7a3c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* not our database, skip */
+		if (s->data.database != InvalidOid && s->data.database != dboid)
+			continue;
+
+		SpinLockAcquire(&s->mutex);
+		slotname = s->data.name;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+		SpinLockRelease(&s->mutex);
+
+		if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_xmin, xid)));
+		}
+
+		if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+		{
+			found_conflict = true;
+
+			ereport(WARNING,
+					(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+							NameStr(slotname), slot_catalog_xmin, xid)));
+		}
+
+
+		if (found_conflict)
+		{
+			elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 215f146..75dbdb9 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f..f4da4bc 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index e66b034..61ca0e8 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -47,6 +47,7 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
 
@@ -94,6 +95,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 22cd13c..482c874 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 					  TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 						  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index ee8fc6f..d535441 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8f1d66..4e0776a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2361243..f276c7e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-									RelFileNode node);
+									bool catalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 9606d02..78bc639 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/recovery/t/016_logical_decoding_on_replica.pl b/src/test/recovery/t/016_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..9ee79b0
--- /dev/null
+++ b/src/test/recovery/t/016_logical_decoding_on_replica.pl
@@ -0,0 +1,391 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+$node_replica->command_ok(['pg_recvlogical', '-v', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');

#30

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#21)

Re: Minimal logical decoding on standbys

On Tue, 9 Apr 2019 at 22:23, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Sat, 6 Apr 2019 at 04:45, Andres Freund <andres@anarazel.de> wrote:

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b..5785d2f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
}
}

+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+     int                     i;
+     bool            found_conflict = false;
+
+     if (max_replication_slots <= 0)
+             return;
+
+restart:
+     if (found_conflict)
+     {
+             CHECK_FOR_INTERRUPTS();
+             /*
+              * Wait awhile for them to die so that we avoid flooding an
+              * unresponsive backend when system is heavily loaded.
+              */
+             pg_usleep(100000);
+             found_conflict = false;
+     }
+
+     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+     for (i = 0; i < max_replication_slots; i++)
+     {
+             ReplicationSlot *s;
+             NameData        slotname;
+             TransactionId slot_xmin;
+             TransactionId slot_catalog_xmin;
+
+             s = &ReplicationSlotCtl->replication_slots[i];
+
+             /* cannot change while ReplicationSlotCtlLock is held */
+             if (!s->in_use)
+                     continue;
+
+             /* not our database, skip */
+             if (s->data.database != InvalidOid && s->data.database != dboid)
+                     continue;
+
+             SpinLockAcquire(&s->mutex);
+             slotname = s->data.name;
+             slot_xmin = s->data.xmin;
+             slot_catalog_xmin = s->data.catalog_xmin;
+             SpinLockRelease(&s->mutex);
+
+             if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+             {
+                     found_conflict = true;
+
+                     ereport(WARNING,
+                                     (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+                                                     NameStr(slotname), slot_xmin, xid)));
+             }
+
+             if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+             {
+                     found_conflict = true;
+
+                     ereport(WARNING,
+                                     (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+                                                     NameStr(slotname), slot_catalog_xmin, xid)));
+             }
+
+
+             if (found_conflict)
+             {
+                     elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+                     LWLockRelease(ReplicationSlotControlLock);      /* avoid deadlock */
+                     ReplicationSlotDropPtr(s);
+
+                     /* We released the lock above; so re-scan the slots. */
+                     goto restart;
+             }
+     }

What I have in mind is :

Does this message look good to you, or you had in mind something quite
different ?

The above one is yet another point that needs to be concluded on. Till
then I will use the above way to display the error message in the
upcoming patch version.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#31

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#28)

Re: Minimal logical decoding on standbys

On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:

Hi,

Sorry for the late response.

On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:

On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:

Not sure why this is happening. On slave, wal_level is logical, so
logical records should have tuple data. Not sure what does that have
to do with wal_level of master. Everything should be there on slave
after it replays the inserts; and also slave wal_level is logical.

The standby doesn't write its own WAL, only primaries do. I thought we
forbade running with wal_level=logical on a standby, when the primary is
only set to replica. But that's not what we do, see
CheckRequiredParameterValues().

I've not yet thought this through, but I think we'll have to somehow
error out in this case. I guess we could just check at the start of
decoding what ControlFile->wal_level is set to,

By "start of decoding", I didn't get where exactly. Do you mean
CheckLogicalDecodingRequirements() ?

Right.

and then raise an error
in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
wal_level to something lower?

Didn't get where exactly we should error out. We don't do
XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
something else, which I didn't understand.

I was indeed thinking of checking XLOG_PARAMETER_CHANGE in
decode.c. Adding handling for that, and just checking wal_level, ought
to be fairly doable? But, see below:

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be

= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical

slot when master has incompatible wal_level.

That still allows the primary to change wal_level after logical decoding
has started, so we need the additional checks.

I'm not yet sure how to best deal with the fact that wal_level might be
changed by the primary at basically all times. We would eventually get
an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
that's not necessarily sufficient - if a primary changes its wal_level
to lower, it could remove information logical decoding needs *before*
logical decoding reaches the XLOG_PARAMETER_CHANGE record.

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

Therefore I think the check in CheckLogicalDecodingRequirements() needs
to be something like:

if (RecoveryInProgress())
{
if (!InHotStandby)
ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
/*
* This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
* wal_level has changed, we verify that there are no existin glogical
* replication slots. And to avoid races around creating a new slot,
* CheckLogicalDecodingRequirements() is called once before creating the slot,
* andd once when logical decoding is initially starting up.
*/
if (ControlFile->wal_level != LOGICAL)
ereport(ERROR, "...");
}

And then add a second CheckLogicalDecodingRequirements() call into
CreateInitDecodingContext().

What do you think?

Yeah, I agree we should add such checks to minimize the possibility of
reading logical records from a master that has insufficient wal_level.
So to summarize :
a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
b. Call this function call in CreateInitDecodingContext() as well.
c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
error if there is an existing logical slot.

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica. So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

Do you think this is possible, or I am missing something? If that's
possible, I was considering some other mechanisms. Like, while reading
each wal_level-change record by a logical reader, save the value in
the ReplicationSlotPersistentData. So while reading the WAL records,
the reader knows whether the records have logical data. If they don't
have, error out. But not sure how will the reader know the very first
record status, i.e. before it gets the wal_level-change record.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#32

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#31)

Re: Minimal logical decoding on standbys

On Thu, May 23, 2019 at 8:08 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica. So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

Do you think this is possible, or I am missing something?

wal_level is PGC_POSTMASTER.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

Sergei Kornilov

sk@zsrv.org

over 6 years ago

In reply to: Robert Haas (#32)

Re: Minimal logical decoding on standbys

Hello

wal_level is PGC_POSTMASTER.

But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only logical) on standby. So online change from logical to replica wal_level is possible on standby's controlfile.

regards, Sergei

#34

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Sergei Kornilov (#33)

Re: Minimal logical decoding on standbys

On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:

wal_level is PGC_POSTMASTER.

But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only logical) on standby. So online change from logical to replica wal_level is possible on standby's controlfile.

That's true, but Amit's scenario involved a change in wal_level during
the execution of pg_create_logical_replication_slot(), which I think
can't happen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#31)

Re: Minimal logical decoding on standbys

Hi,

On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:

On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
Yeah, I agree we should add such checks to minimize the possibility of
reading logical records from a master that has insufficient wal_level.
So to summarize :
a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
b. Call this function call in CreateInitDecodingContext() as well.
c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
error if there is an existing logical slot.

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica. So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

I don't think that's an actual problem, because there's no decoding
before the slot exists and CreateInitDecodingContext() has determined
the start LSN. And by that point the slot exists, slo
XLOG_PARAMETER_CHANGE replay can error out.

Greetings,

Andres Freund

#36

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Robert Haas (#34)

Re: Minimal logical decoding on standbys

Hi,

On 2019-05-23 09:37:50 -0400, Robert Haas wrote:

On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:

wal_level is PGC_POSTMASTER.

But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only logical) on standby. So online change from logical to replica wal_level is possible on standby's controlfile.

That's true, but Amit's scenario involved a change in wal_level during
the execution of pg_create_logical_replication_slot(), which I think
can't happen.

I don't see why not - we're talking about the wal_level in the WAL
stream, not the setting on the standby. And that can change during the
execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
record is replayed. I don't think it's actually a problem, as I
outlined in my response to Amit, though.

Greetings,

Andres Freund

#37

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#35)

Re: Minimal logical decoding on standbys

On Thu, 23 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:

On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
Yeah, I agree we should add such checks to minimize the possibility of
reading logical records from a master that has insufficient wal_level.
So to summarize :
a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
b. Call this function call in CreateInitDecodingContext() as well.
c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
error if there is an existing logical slot.

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica. So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

I don't think that's an actual problem, because there's no decoding
before the slot exists and CreateInitDecodingContext() has determined
the start LSN. And by that point the slot exists, slo
XLOG_PARAMETER_CHANGE replay can error out.

So between the start lsn and the lsn for
parameter-change(logical=>replica) record, there can be some records ,
and these don't have logical data. So the slot created will read from
the start lsn, and proceed to read these records, before reading the
parameter-change record.

Can you re-write the below phrase please ? I suspect there is some
letters missing there :
"And by that point the slot exists, slo XLOG_PARAMETER_CHANGE replay
can error out"

Are you saying we want to error out when the postgres replays the
param change record and there is existing logical slot ? I thought you
were suggesting earlier that it's the decoder.c code which should
error out when reading the param-change record.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#38

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#37)

Re: Minimal logical decoding on standbys

Hi,

On 2019-05-23 23:08:55 +0530, Amit Khandekar wrote:

On Thu, 23 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:

On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:

On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
Yeah, I agree we should add such checks to minimize the possibility of
reading logical records from a master that has insufficient wal_level.
So to summarize :
a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
b. Call this function call in CreateInitDecodingContext() as well.
c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
error if there is an existing logical slot.

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica. So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

I don't think that's an actual problem, because there's no decoding
before the slot exists and CreateInitDecodingContext() has determined
the start LSN. And by that point the slot exists, slo
XLOG_PARAMETER_CHANGE replay can error out.

So between the start lsn and the lsn for
parameter-change(logical=>replica) record, there can be some records ,
and these don't have logical data. So the slot created will read from
the start lsn, and proceed to read these records, before reading the
parameter-change record.

I don't think that's possible. By the time CreateInitDecodingContext()
is called, the slot *already* exists (but in a state that'll cause it to
be throw away on error). But the restart point has not yet been
determined. Thus, if there is a XLOG_PARAMETER_CHANGE with a wal_level
change it can error out. And to handle the race of wal_level changing
between CheckLogicalDecodingRequirements() and the slot creation, we
recheck in CreateInitDecodingContext().

Think we might nee dto change ReplicationSlotReserveWal() to use the
replay, rather than the redo pointer for logical slots though.

Can you re-write the below phrase please ? I suspect there is some
letters missing there :
"And by that point the slot exists, slo XLOG_PARAMETER_CHANGE replay
can error out"

I think it's just one additional letter, namely s/slo/so/

Are you saying we want to error out when the postgres replays the
param change record and there is existing logical slot ? I thought you
were suggesting earlier that it's the decoder.c code which should
error out when reading the param-change record.

Yes, that's what I'm saying. See this portion of my previous email on
the topic:

On 2019-05-21 09:19:37 -0700, Andres Freund wrote:

On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be

= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical

slot when master has incompatible wal_level.

That still allows the primary to change wal_level after logical decoding
has started, so we need the additional checks.

I'm not yet sure how to best deal with the fact that wal_level might be
changed by the primary at basically all times. We would eventually get
an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
that's not necessarily sufficient - if a primary changes its wal_level
to lower, it could remove information logical decoding needs *before*
logical decoding reaches the XLOG_PARAMETER_CHANGE record.

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

Therefore I think the check in CheckLogicalDecodingRequirements() needs
to be something like:

if (RecoveryInProgress())
{
if (!InHotStandby)
ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
/*
* This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
* wal_level has changed, we verify that there are no existin glogical
* replication slots. And to avoid races around creating a new slot,
* CheckLogicalDecodingRequirements() is called once before creating the slot,
* andd once when logical decoding is initially starting up.
*/
if (ControlFile->wal_level != LOGICAL)
ereport(ERROR, "...");
}

And then add a second CheckLogicalDecodingRequirements() call into
CreateInitDecodingContext().

What do you think?

Greetings,

Andres Freund

#39

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#38)

Re: Minimal logical decoding on standbys

On Thu, 23 May 2019 at 23:18, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-05-23 23:08:55 +0530, Amit Khandekar wrote:

On Thu, 23 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:

On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:

On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
Yeah, I agree we should add such checks to minimize the possibility of
reading logical records from a master that has insufficient wal_level.
So to summarize :
a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
b. Call this function call in CreateInitDecodingContext() as well.
c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
error if there is an existing logical slot.

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica. So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

I don't think that's an actual problem, because there's no decoding
before the slot exists and CreateInitDecodingContext() has determined
the start LSN. And by that point the slot exists, slo
XLOG_PARAMETER_CHANGE replay can error out.

So between the start lsn and the lsn for
parameter-change(logical=>replica) record, there can be some records ,
and these don't have logical data. So the slot created will read from
the start lsn, and proceed to read these records, before reading the
parameter-change record.

I don't think that's possible. By the time CreateInitDecodingContext()
is called, the slot *already* exists (but in a state that'll cause it to
be throw away on error). But the restart point has not yet been
determined. Thus, if there is a XLOG_PARAMETER_CHANGE with a wal_level
change it can error out. And to handle the race of wal_level changing
between CheckLogicalDecodingRequirements() and the slot creation, we
recheck in CreateInitDecodingContext().

ok, got it now. I was concerned that there might be some such cases
unhandled because we are not using locks to handle such concurrency
conditions. But as you have explained, the checks we are adding will
avoid this race condition.

Think we might nee dto change ReplicationSlotReserveWal() to use the
replay, rather than the redo pointer for logical slots though.

Not thought of this; will get back.

Working on the patch now ....

Are you saying we want to error out when the postgres replays the
param change record and there is existing logical slot ? I thought you
were suggesting earlier that it's the decoder.c code which should
error out when reading the param-change record.

Yes, that's what I'm saying. See this portion of my previous email on
the topic:

Yeah, thanks for pointing that.

On 2019-05-21 09:19:37 -0700, Andres Freund wrote:

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#40

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#39)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Working on the patch now ....

Attached is an incremental WIP patch
handle_wal_level_changes_WIP.patch to be applied over the earlier main
patch logical-decoding-on-standby_v4_rebased.patch.

On 2019-05-21 09:19:37 -0700, Andres Freund wrote:

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

Yet to do this. Andres, how do you want to handle this scenario ? Just
drop all the existing logical slots like what we decided for conflict
recovery for conflicting xids ?

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

handle_wal_level_changes_WIP.patchapplication/octet-stream; name=handle_wal_level_changes_WIP.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 527522f..b26a20a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4928,6 +4928,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+int
+ControlFileWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eec3a22..2c638e9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 31951bd..aab2f747 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,23 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+	if (RecoveryInProgress())
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (ControlFileWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
+
 #ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
@@ -243,6 +260,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2af938b..8280d39 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern int ControlFileWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);

#41

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#40)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Fri, 24 May 2019 at 21:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Working on the patch now ....

Attached is an incremental WIP patch
handle_wal_level_changes_WIP.patch to be applied over the earlier main
patch logical-decoding-on-standby_v4_rebased.patch.

I found an issue with these changes : When we change master wal_level
from logical to hot_standby, and again back to logical, and then
create a logical replication slot on slave, it gets created; but when
I do pg_logical_slot_get_changes() with that slot, it seems to read
records *before* I created the logical slot, so it encounters
parameter-change(logical=>hot_standby) record, so returns an error as
per the patch, because now in DecodeXLogOp() I error out when
XLOG_PARAMETER_CHANGE is found :

@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
             * can restart from there.
             */
            break;
+         case XLOG_PARAMETER_CHANGE:
+         {
+           xl_parameter_change *xlrec =
+             (xl_parameter_change *) XLogRecGetData(buf->record);
+
+           /* Cannot proceed if master itself does not have logical data */
+           if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+             ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("logical decoding on standby requires "
+                     "wal_level >= logical on master")));
+           break;
+         }

I thought it won't read records *before* the slot was created. Am I
missing something ?

On 2019-05-21 09:19:37 -0700, Andres Freund wrote:

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

Yet to do this. Andres, how do you want to handle this scenario ? Just
drop all the existing logical slots like what we decided for conflict
recovery for conflicting xids ?

I went ahead and added handling that drops existing slots when we
encounter XLOG_PARAMETER_CHANGE in xlog_redo().

Attached is logical-decoding-on-standby_v5.patch, that contains all
the changes so far.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v5.patchapplication/octet-stream; name=logical-decoding-on-standby_v5.patchDownload

From 5c4dff8c936b4285031ba2c4241a8667d99805fa Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Mon, 27 May 2019 16:59:51 +0530
Subject: [PATCH] Logical decoding on standby.

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

-Amit Khandekar
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   3 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  20 ++
 src/backend/replication/logical/decode.c           |  14 +-
 src/backend/replication/logical/logical.c          |  21 ++
 src/backend/replication/slot.c                     |  93 +++++
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 .../recovery/t/016_logical_decoding_on_replica.pl  | 391 +++++++++++++++++++++
 27 files changed, 613 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/016_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 419da87..4093281 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7117,12 +7117,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7158,6 +7159,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7208,6 +7210,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7238,7 +7241,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7248,6 +7251,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7668,7 +7672,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7704,7 +7709,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7800,7 +7806,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7937,7 +7945,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index de4d4ef..9b1231e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..eaaf631 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1c7dd51..d5d0522 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4928,6 +4928,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+int
+ControlFileWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9845,6 +9854,17 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have disallowed creation of logical slots.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bbd38c0..c0dd327 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,24 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+	if (RecoveryInProgress())
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (ControlFileWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
+
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +129,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
@@ -241,6 +260,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..9027f06 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1065,6 +1065,99 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with slots.
+ *
+ * When xid is valid, it means it's a removed-xid kind of conflict, so need to
+ * drop the appropriate slots whose xmin conflicts with removed xid.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped.
+ */
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
+			found_conflict = true;
+		else
+		{
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				found_conflict = true;
+
+				ereport(LOG,
+						(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+								NameStr(slotname), slot_xmin, xid)));
+			}
+
+			if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+			{
+				found_conflict = true;
+
+				ereport(LOG,
+						(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+								NameStr(slotname), slot_catalog_xmin, xid)));
+			}
+
+		}
+		if (found_conflict)
+		{
+			elog(LOG, "Dropping conflicting slot %s", s->data.name.data);
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 842fcab..dda6b4d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f..f4da4bc 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 237f4e0..fa02728 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern int ControlFileWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8bc7f52..522153a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/recovery/t/016_logical_decoding_on_replica.pl b/src/test/recovery/t/016_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..9ee79b0
--- /dev/null
+++ b/src/test/recovery/t/016_logical_decoding_on_replica.pl
@@ -0,0 +1,391 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->psql('testdb', qq[SELECT * FROM pg_create_logical_replication_slot('standby_logical', 'test_decoding')]),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-P', 'test_decoding', '-S', 'dodropslot', '--create-slot'], 'pg_recvlogical created dodropslot');
+$node_replica->command_ok(['pg_recvlogical', '-v', '-d', $node_replica->connstr('postgres'), '-P', 'test_decoding', '-S', 'otherslot', '--create-slot'], 'pg_recvlogical created otherslot');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#42

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#41)

Re: Minimal logical decoding on standbys

On 2019-05-27 17:04:44 +0530, Amit Khandekar wrote:

On Fri, 24 May 2019 at 21:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Working on the patch now ....

Attached is an incremental WIP patch
handle_wal_level_changes_WIP.patch to be applied over the earlier main
patch logical-decoding-on-standby_v4_rebased.patch.

I found an issue with these changes : When we change master wal_level
from logical to hot_standby, and again back to logical, and then
create a logical replication slot on slave, it gets created; but when
I do pg_logical_slot_get_changes() with that slot, it seems to read
records *before* I created the logical slot, so it encounters
parameter-change(logical=>hot_standby) record, so returns an error as
per the patch, because now in DecodeXLogOp() I error out when
XLOG_PARAMETER_CHANGE is found :

@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+         case XLOG_PARAMETER_CHANGE:
+         {
+           xl_parameter_change *xlrec =
+             (xl_parameter_change *) XLogRecGetData(buf->record);
+
+           /* Cannot proceed if master itself does not have logical data */
+           if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+             ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("logical decoding on standby requires "
+                     "wal_level >= logical on master")));
+           break;
+         }

I thought it won't read records *before* the slot was created. Am I
missing something ?

That's why I had mentioned that you'd need to adapt
ReplicationSlotReserveWal(), to use the replay LSN or such.

Greetings,

Andres Freund

#43

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#42)

Re: Minimal logical decoding on standbys

On Mon, 27 May 2019 at 19:26, Andres Freund <andres@anarazel.de> wrote:

On 2019-05-27 17:04:44 +0530, Amit Khandekar wrote:

On Fri, 24 May 2019 at 21:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Working on the patch now ....

Attached is an incremental WIP patch
handle_wal_level_changes_WIP.patch to be applied over the earlier main
patch logical-decoding-on-standby_v4_rebased.patch.

I found an issue with these changes : When we change master wal_level
from logical to hot_standby, and again back to logical, and then
create a logical replication slot on slave, it gets created; but when
I do pg_logical_slot_get_changes() with that slot, it seems to read
records *before* I created the logical slot, so it encounters
parameter-change(logical=>hot_standby) record, so returns an error as
per the patch, because now in DecodeXLogOp() I error out when
XLOG_PARAMETER_CHANGE is found :
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+         case XLOG_PARAMETER_CHANGE:
+         {
+           xl_parameter_change *xlrec =
+             (xl_parameter_change *) XLogRecGetData(buf->record);
+
+           /* Cannot proceed if master itself does not have logical data */
+           if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+             ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("logical decoding on standby requires "
+                     "wal_level >= logical on master")));
+           break;
+         }
I thought it won't read records *before* the slot was created. Am I
missing something ?
That's why I had mentioned that you'd need to adapt
ReplicationSlotReserveWal(), to use the replay LSN or such.

Yeah ok. I tried to do this :

@@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
 if (!RecoveryInProgress() && SlotIsLogical(slot))
 {
    ....
 }
 else
 {
-   restart_lsn = GetRedoRecPtr();
+   restart_lsn = SlotIsLogical(slot) ?
+                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();

But then when I do pg_create_logical_replication_slot(), it hangs in
DecodingContextFindStartpoint(), waiting to find new records
(XLogReadRecord).

Working on it ...

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#44

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#43)

Re: Minimal logical decoding on standbys

Hi,

On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:

@@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
....
}
else
{
-   restart_lsn = GetRedoRecPtr();
+   restart_lsn = SlotIsLogical(slot) ?
+                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
But then when I do pg_create_logical_replication_slot(), it hangs in
DecodingContextFindStartpoint(), waiting to find new records
(XLogReadRecord).

But just till the primary has logged the necessary WAL records? If you
just do CHECKPOINT; or such on the primary, it should succeed quickly?

Greetings,

Andres Freund

#45

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#44)

Re: Minimal logical decoding on standbys

On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:
@@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
....
}
else
{
-   restart_lsn = GetRedoRecPtr();
+   restart_lsn = SlotIsLogical(slot) ?
+                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
But then when I do pg_create_logical_replication_slot(), it hangs in
DecodingContextFindStartpoint(), waiting to find new records
(XLogReadRecord).
But just till the primary has logged the necessary WAL records? If you
just do CHECKPOINT; or such on the primary, it should succeed quickly?

Yes, it waits until there is a commit record, or (just tried) until a
checkpoint command.

Greetings,

Andres Freund

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#46

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#45)

Re: Minimal logical decoding on standbys

On Fri, 31 May 2019 at 11:08, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:
@@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
....
}
else
{
-   restart_lsn = GetRedoRecPtr();
+   restart_lsn = SlotIsLogical(slot) ?
+                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
But then when I do pg_create_logical_replication_slot(), it hangs in
DecodingContextFindStartpoint(), waiting to find new records
(XLogReadRecord).
But just till the primary has logged the necessary WAL records? If you
just do CHECKPOINT; or such on the primary, it should succeed quickly?
Yes, it waits until there is a commit record, or (just tried) until a
checkpoint command.

Is XLOG_RUNNING_XACTS record essential for the logical decoding to
build a consistent snapshot ?
Since the restart_lsn is now ReplayRecPtr, there is no
XLOG_RUNNING_XACTS record, and so the snapshot state is not yet
SNAPBUILD_CONSISTENT. And so
DecodingContextFindStartpoint()=>DecodingContextReady() never returns
true, and hence DecodingContextFindStartpoint() goes in an infinite
loop, until it gets XLOG_RUNNING_XACTS.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#47

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#46)

Re: Minimal logical decoding on standbys

On Fri, 31 May 2019 at 17:31, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Fri, 31 May 2019 at 11:08, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:
@@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
....
}
else
{
-   restart_lsn = GetRedoRecPtr();
+   restart_lsn = SlotIsLogical(slot) ?
+                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
But then when I do pg_create_logical_replication_slot(), it hangs in
DecodingContextFindStartpoint(), waiting to find new records
(XLogReadRecord).
But just till the primary has logged the necessary WAL records? If you
just do CHECKPOINT; or such on the primary, it should succeed quickly?
Yes, it waits until there is a commit record, or (just tried) until a
checkpoint command.
Is XLOG_RUNNING_XACTS record essential for the logical decoding to
build a consistent snapshot ?
Since the restart_lsn is now ReplayRecPtr, there is no
XLOG_RUNNING_XACTS record, and so the snapshot state is not yet
SNAPBUILD_CONSISTENT. And so
DecodingContextFindStartpoint()=>DecodingContextReady() never returns
true, and hence DecodingContextFindStartpoint() goes in an infinite
loop, until it gets XLOG_RUNNING_XACTS.

After giving more thought on this, I think it might make sense to
arrange for the xl_running_xact record to be sent from master to the
standby, when a logical slot is to be created on standby. How about
standby sending a new message type to the master, requesting for
xl_running_xact record ? Then on master, ProcessStandbyMessage() will
process this new message type and call LogStandbySnapshot().

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#48

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#46)

Re: Minimal logical decoding on standbys

Hi,

On 2019-05-31 17:31:34 +0530, Amit Khandekar wrote:

On Fri, 31 May 2019 at 11:08, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:
Yes, it waits until there is a commit record, or (just tried) until a
checkpoint command.

That's fine with me.

Is XLOG_RUNNING_XACTS record essential for the logical decoding to
build a consistent snapshot ?

Yes.

Since the restart_lsn is now ReplayRecPtr, there is no
XLOG_RUNNING_XACTS record, and so the snapshot state is not yet
SNAPBUILD_CONSISTENT. And so
DecodingContextFindStartpoint()=>DecodingContextReady() never returns
true, and hence DecodingContextFindStartpoint() goes in an infinite
loop, until it gets XLOG_RUNNING_XACTS.

These seem like conflicting statements? Infinite loops don't terminate
until a record is logged?

Greetings,

Andres Freund

#49

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#47)

Re: Minimal logical decoding on standbys

Hi,

On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:

After giving more thought on this, I think it might make sense to
arrange for the xl_running_xact record to be sent from master to the
standby, when a logical slot is to be created on standby. How about
standby sending a new message type to the master, requesting for
xl_running_xact record ? Then on master, ProcessStandbyMessage() will
process this new message type and call LogStandbySnapshot().

I think that should be a secondary feature. You don't necessarily know
the upstream master, as the setup could be cascading one. I think for
now just having to wait, perhaps with a comment to manually start a
checkpoint, ought to suffice?

Greetings,

Andres Freund

#50

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#49)

Re: Minimal logical decoding on standbys

On Tue, 4 Jun 2019 at 21:28, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:

After giving more thought on this, I think it might make sense to
arrange for the xl_running_xact record to be sent from master to the
standby, when a logical slot is to be created on standby. How about
standby sending a new message type to the master, requesting for
xl_running_xact record ? Then on master, ProcessStandbyMessage() will
process this new message type and call LogStandbySnapshot().

I think that should be a secondary feature. You don't necessarily know
the upstream master, as the setup could be cascading one.

Oh yeah, cascading setup makes it more complicated.

I think for
now just having to wait, perhaps with a comment to manually start a
checkpoint, ought to suffice?

Ok.

Since this requires the test to handle the
fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
modifying the test file to do this. After doing that, I found that the
slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
This happens only when the restart_lsn is set to ReplayRecPtr.
Somehow, this does not happen when I manually create the logical slot.
It happens only while running testcase. Working on it ...

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#51

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#50)

Re: Minimal logical decoding on standbys

On Mon, 10 Jun 2019 at 10:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Tue, 4 Jun 2019 at 21:28, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:

After giving more thought on this, I think it might make sense to
arrange for the xl_running_xact record to be sent from master to the
standby, when a logical slot is to be created on standby. How about
standby sending a new message type to the master, requesting for
xl_running_xact record ? Then on master, ProcessStandbyMessage() will
process this new message type and call LogStandbySnapshot().

I think that should be a secondary feature. You don't necessarily know
the upstream master, as the setup could be cascading one.

Oh yeah, cascading setup makes it more complicated.

I think for
now just having to wait, perhaps with a comment to manually start a
checkpoint, ought to suffice?

Ok.

Since this requires the test to handle the
fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
modifying the test file to do this. After doing that, I found that the
slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
This happens only when the restart_lsn is set to ReplayRecPtr.
Somehow, this does not happen when I manually create the logical slot.
It happens only while running testcase. Working on it ...

Like I mentioned above, I get an assertion failure for
Assert(XRecOffIsValid(RecPtr)) while reading WAL records looking for a
start position (DecodingContextFindStartpoint()). This is because in
CreateInitDecodingContext()=>ReplicationSlotReserveWal(), I now set
the logical slot's restart_lsn to XLogCtl->lastReplayedEndRecPtr. And
just after bringing up slave, lastReplayedEndRecPtr's initial values
are in this order : 0/2000028, 0/2000060, 0/20000D8, 0/2000100,
0/3000000, 0/3000060. You can see that 0/3000000 is not a valid value
because it points to the start of a WAL block, meaning it points to
the XLog page header (I think it's possible because it is 1 + endof
last replayed record, which can be start of next block). So when we
try to create a slot when it's in that position, then XRecOffIsValid()
fails while looking for a starting point.

One option I considered was : If lastReplayedEndRecPtr points to XLog
page header, get a position of the first record on that WAL block,
probably with XLogFindNextRecord(). But it is not trivial because
while in ReplicationSlotReserveWal(), XLogReaderState is not created
yet. Or else, do you think we can just increment the record pointer by
doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
SizeOfXLogShortPHD() ?

Do you think that we can solve this using some other approach ? I am
not sure whether it's only the initial conditions that cause
lastReplayedEndRecPtr value to *not* point to a valid record, or is it
just a coincidence and that lastReplayedEndRecPtr can also have such a
value any time afterwards. If it's only possible initially, we can
just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
lastReplayedEndRecPtr is invalid.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#52

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Andres Freund (#36)

Re: Minimal logical decoding on standbys

On 2019-May-23, Andres Freund wrote:

On 2019-05-23 09:37:50 -0400, Robert Haas wrote:

On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:

wal_level is PGC_POSTMASTER.

But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only logical) on standby. So online change from logical to replica wal_level is possible on standby's controlfile.

That's true, but Amit's scenario involved a change in wal_level during
the execution of pg_create_logical_replication_slot(), which I think
can't happen.

I don't see why not - we're talking about the wal_level in the WAL
stream, not the setting on the standby. And that can change during the
execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
record is replayed. I don't think it's actually a problem, as I
outlined in my response to Amit, though.

I don't know if this is directly relevant, but in commit_ts.c we go to
great lengths to ensure that things continue to work across restarts and
changes of the GUC in the primary, by decoupling activation and
deactivation of the module from start-time initialization. Maybe that
idea is applicable for this too?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#53

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#51)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Tue, 11 Jun 2019 at 12:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Mon, 10 Jun 2019 at 10:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Tue, 4 Jun 2019 at 21:28, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:

After giving more thought on this, I think it might make sense to
arrange for the xl_running_xact record to be sent from master to the
standby, when a logical slot is to be created on standby. How about
standby sending a new message type to the master, requesting for
xl_running_xact record ? Then on master, ProcessStandbyMessage() will
process this new message type and call LogStandbySnapshot().

I think that should be a secondary feature. You don't necessarily know
the upstream master, as the setup could be cascading one.

Oh yeah, cascading setup makes it more complicated.

I think for
now just having to wait, perhaps with a comment to manually start a
checkpoint, ought to suffice?

Ok.

Since this requires the test to handle the
fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
modifying the test file to do this. After doing that, I found that the
slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
This happens only when the restart_lsn is set to ReplayRecPtr.
Somehow, this does not happen when I manually create the logical slot.
It happens only while running testcase. Working on it ...

Like I mentioned above, I get an assertion failure for
Assert(XRecOffIsValid(RecPtr)) while reading WAL records looking for a
start position (DecodingContextFindStartpoint()). This is because in
CreateInitDecodingContext()=>ReplicationSlotReserveWal(), I now set
the logical slot's restart_lsn to XLogCtl->lastReplayedEndRecPtr. And
just after bringing up slave, lastReplayedEndRecPtr's initial values
are in this order : 0/2000028, 0/2000060, 0/20000D8, 0/2000100,
0/3000000, 0/3000060. You can see that 0/3000000 is not a valid value
because it points to the start of a WAL block, meaning it points to
the XLog page header (I think it's possible because it is 1 + endof
last replayed record, which can be start of next block). So when we
try to create a slot when it's in that position, then XRecOffIsValid()
fails while looking for a starting point.

One option I considered was : If lastReplayedEndRecPtr points to XLog
page header, get a position of the first record on that WAL block,
probably with XLogFindNextRecord(). But it is not trivial because
while in ReplicationSlotReserveWal(), XLogReaderState is not created
yet.

In the attached v6 version of the patch, I did the above. That is, I
used XLogFindNextRecord() to bump up the restart_lsn of the slot to
the first valid record. But since XLogReaderState is not available in
ReplicationSlotReserveWal(), I did this in
DecodingContextFindStartpoint(). And then updated the slot restart_lsn
with this corrected position.

Since XLogFindNextRecord() is currently disabled using #if 0, removed
this directive.

Or else, do you think we can just increment the record pointer by
doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
SizeOfXLogShortPHD() ?

I found out that we can't do this, because we don't know whether the
xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
our context, it is SizeOfXLogLongPHD. So we indeed need the
XLogReaderState handle.

Do you think that we can solve this using some other approach ? I am
not sure whether it's only the initial conditions that cause
lastReplayedEndRecPtr value to *not* point to a valid record, or is it
just a coincidence and that lastReplayedEndRecPtr can also have such a
value any time afterwards. If it's only possible initially, we can
just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
lastReplayedEndRecPtr is invalid.

So now as the v6 patch stands, lastReplayedEndRecPtr is used to set
the restart_lsn, but its position is later adjusted in
DecodingContextFindStartpoint().

Also, modified the test to handle the requirement that the logical
slot creation on standby requires a checkpoint (or any other
transaction commit) to be given from master. For that, in
src/test/perl/PostgresNode.pm, added a new function
create_logical_slot_on_standby() which does the reqiured steps.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v6.patchapplication/octet-stream; name=logical-decoding-on-standby_v6.patchDownload

From 0ec74a3ab5e1d728223fab2c018f5b8a0612848b Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 12 Jun 2019 17:18:42 +0530
Subject: [PATCH] Logical decoding on standby - v6.

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

   Changes in this v6 patch :

	  While creating the slot, lastReplayedEndRecPtr is used to set the
	  restart_lsn, but its position is later adjusted in
	  DecodingContextFindStartpoint() in case it does not point to a
	  valid record location. This can happen because replay pointer
	  points to 1 + end of last record replayed, which means it can
	  coincide with first byte of a new WAL block, i.e. inside block
	  header.

	  Also, modified the test to handle the requirement that the
	  logical slot creation on standby requires a checkpoint
	  (or any other transaction commit) to be given from master. For
	  that, in src/test/perl/PostgresNode.pm, added a new function
	  create_logical_slot_on_standby() which does the reqiured steps.

-Amit Khandekar.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   3 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  20 ++
 src/backend/access/transam/xlogreader.c            |   4 -
 src/backend/replication/logical/decode.c           |  14 +-
 src/backend/replication/logical/logical.c          |  41 +++
 src/backend/replication/slot.c                     | 131 ++++++-
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/access/xlogreader.h                    |   2 -
 src/include/replication/slot.h                     |   2 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 src/test/perl/PostgresNode.pm                      |  27 ++
 .../recovery/t/016_logical_decoding_on_replica.pl  | 395 +++++++++++++++++++++
 30 files changed, 683 insertions(+), 44 deletions(-)
 create mode 100644 src/test/recovery/t/016_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8ac0f8a..0791a4e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7108,12 +7108,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7149,6 +7150,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7199,6 +7201,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7229,7 +7232,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7239,6 +7242,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7659,7 +7663,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7695,7 +7700,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7791,7 +7797,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7928,7 +7936,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index de4d4ef..9b1231e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..eaaf631 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e08320e..f092800 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4926,6 +4926,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+int
+ControlFileWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9843,6 +9852,17 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have disallowed creation of logical slots.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 88be7fe..431a302 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -878,7 +878,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	return true;
 }
 
-#ifdef FRONTEND
 /*
  * Functions that are currently not needed in the backend, but are better
  * implemented inside xlogreader.c because of the internal facilities available
@@ -1003,9 +1002,6 @@ out:
 	return found;
 }
 
-#endif							/* FRONTEND */
-
-
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bbd38c0..9f6e0ac 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,24 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+	if (RecoveryInProgress())
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (ControlFileWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
+
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +129,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
@@ -241,6 +260,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
@@ -474,6 +495,26 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 		 (uint32) (slot->data.restart_lsn >> 32),
 		 (uint32) slot->data.restart_lsn);
 
+	/*
+	 * It is not guaranteed that the restart_lsn points to a valid
+	 * record location. E.g. on standby, restart_lsn initially points to lastReplayedEndRecPtr,
+	 * which is 1 + the end of last replayed record, which means it can point the next
+	 * block header start. So bump it to the next valid record.
+	 */
+	if (!XRecOffIsValid(startptr))
+	{
+		elog(DEBUG1, "Invalid restart lsn %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+		startptr = XLogFindNextRecord(ctx->reader, startptr);
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = startptr;
+		SpinLockRelease(&slot->mutex);
+
+		elog(DEBUG1, "Moved slot restart lsn to %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+	}
+
 	/* Wait for a consistent starting point */
 	for (;;)
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..7ffd264 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1016,37 +1016,37 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record so that
+		 * a snapshot can be built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+
+		restart_lsn =
+			(SlotIsPhysical(slot) ? GetRedoRecPtr() :
+			(RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
+									GetXLogInsertRecPtr()));
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1065,6 +1065,99 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with slots.
+ *
+ * When xid is valid, it means it's a removed-xid kind of conflict, so need to
+ * drop the appropriate slots whose xmin conflicts with removed xid.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped.
+ */
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
+			found_conflict = true;
+		else
+		{
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				found_conflict = true;
+
+				ereport(LOG,
+						(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+								NameStr(slotname), slot_xmin, xid)));
+			}
+
+			if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+			{
+				found_conflict = true;
+
+				ereport(LOG,
+						(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+								NameStr(slotname), slot_catalog_xmin, xid)));
+			}
+
+		}
+		if (found_conflict)
+		{
+			elog(LOG, "Dropping conflicting slot %s", s->data.name.data);
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25b7e31..93c4439 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f..f4da4bc 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 237f4e0..fa02728 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern int ControlFileWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 04228e2..a5ffffc 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -215,9 +215,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 /* Invalidate read state */
 extern void XLogReaderInvalReadState(XLogReaderState *state);
 
-#ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
-#endif							/* FRONTEND */
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8bc7f52..522153a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8d5ad6b..a9a1ac7 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2009,6 +2009,33 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+	sleep(1);
+
+	# Slot creation on standby waits for an xl_running_xacts record. So arrange
+	# for it.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	return 0;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/016_logical_decoding_on_replica.pl b/src/test/recovery/t/016_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..304f32a
--- /dev/null
+++ b/src/test/recovery/t/016_logical_decoding_on_replica.pl
@@ -0,0 +1,395 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+is($node_replica->create_logical_slot_on_standby($node_master, 'dodropslot', 'testdb'),
+   0, 'created dodropslot on testdb')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+is($node_replica->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'),
+   0, 'created otherslot on postgres')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#54

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#30)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Wed, 22 May 2019 at 15:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Tue, 9 Apr 2019 at 22:23, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On Sat, 6 Apr 2019 at 04:45, Andres Freund <andres@anarazel.de> wrote:
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 006446b..5785d2f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
}
}
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+     int                     i;
+     bool            found_conflict = false;
+
+     if (max_replication_slots <= 0)
+             return;
+
+restart:
+     if (found_conflict)
+     {
+             CHECK_FOR_INTERRUPTS();
+             /*
+              * Wait awhile for them to die so that we avoid flooding an
+              * unresponsive backend when system is heavily loaded.
+              */
+             pg_usleep(100000);
+             found_conflict = false;
+     }
+
+     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+     for (i = 0; i < max_replication_slots; i++)
+     {
+             ReplicationSlot *s;
+             NameData        slotname;
+             TransactionId slot_xmin;
+             TransactionId slot_catalog_xmin;
+
+             s = &ReplicationSlotCtl->replication_slots[i];
+
+             /* cannot change while ReplicationSlotCtlLock is held */
+             if (!s->in_use)
+                     continue;
+
+             /* not our database, skip */
+             if (s->data.database != InvalidOid && s->data.database != dboid)
+                     continue;
+
+             SpinLockAcquire(&s->mutex);
+             slotname = s->data.name;
+             slot_xmin = s->data.xmin;
+             slot_catalog_xmin = s->data.catalog_xmin;
+             SpinLockRelease(&s->mutex);
+
+             if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+             {
+                     found_conflict = true;
+
+                     ereport(WARNING,
+                                     (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+                                                     NameStr(slotname), slot_xmin, xid)));
+             }
+
+             if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+             {
+                     found_conflict = true;
+
+                     ereport(WARNING,
+                                     (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+                                                     NameStr(slotname), slot_catalog_xmin, xid)));
+             }
+
+
+             if (found_conflict)
+             {
+                     elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
+                     LWLockRelease(ReplicationSlotControlLock);      /* avoid deadlock */
+                     ReplicationSlotDropPtr(s);
+
+                     /* We released the lock above; so re-scan the slots. */
+                     goto restart;
+             }
+     }
I think this should be refactored so that the two found_conflict cases
set a 'reason' variable (perhaps an enum?) to the particular reason, and
then only one warning should be emitted. I also think that LOG might be
more appropriate than WARNING - as confusing as that is, LOG is more
severe than WARNING (see docs about log_min_messages).
What I have in mind is :

ereport(LOG,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("Dropping conflicting slot %s", s->data.name.data),
errdetail("%s, removed xid %d.", conflict_str, xid)));
where conflict_str is a dynamically generated string containing
something like : "slot xmin : 1234, slot catalog_xmin: 5678"
So for the user, the errdetail will look like :
"slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
I think the user can figure out whether it was xmin or catalog_xmin or
both that conflicted with removed xid.
If we don't do this way, we may not be able to show in a single
message if both xmin and catalog_xmin are conflicting at the same
time.

Does this message look good to you, or you had in mind something quite
different ?
The above one is yet another point that needs to be concluded on. Till
then I will use the above way to display the error message in the
upcoming patch version.

Attached is v7 version that has the above changes regarding having a
single error message.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v7.patchapplication/octet-stream; name=logical-decoding-on-standby_v7.patchDownload

From 183355d4128f34488aef5b20ba4612d3fcbe358e Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Fri, 14 Jun 2019 16:46:41 +0530
Subject: [PATCH] Logical decoding on standby - v7.

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

   Changes in v6 patch :

	  While creating the slot, lastReplayedEndRecPtr is used to set the
	  restart_lsn, but its position is later adjusted in
	  DecodingContextFindStartpoint() in case it does not point to a
	  valid record location. This can happen because replay pointer
	  points to 1 + end of last record replayed, which means it can
	  coincide with first byte of a new WAL block, i.e. inside block
	  header.

	  Also, modified the test to handle the requirement that the
	  logical slot creation on standby requires a checkpoint
	  (or any other transaction commit) to be given from master. For
	  that, in src/test/perl/PostgresNode.pm, added a new function
	  create_logical_slot_on_standby() which does the reqiured steps.

	Changes in v7 patch :
		Merge the two conflict messages for xmin and catalog_xmin into
		a single one.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   3 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  21 ++
 src/backend/access/transam/xlogreader.c            |   4 -
 src/backend/replication/logical/decode.c           |  14 +-
 src/backend/replication/logical/logical.c          |  41 +++
 src/backend/replication/slot.c                     | 146 +++++++-
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/access/xlogreader.h                    |   2 -
 src/include/replication/slot.h                     |   2 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 src/test/perl/PostgresNode.pm                      |  27 ++
 .../recovery/t/018_logical_decoding_on_replica.pl  | 395 +++++++++++++++++++++
 30 files changed, 699 insertions(+), 44 deletions(-)
 create mode 100644 src/test/recovery/t/018_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a775760..58ec991 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7150,12 +7150,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7191,6 +7192,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7241,6 +7243,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7271,7 +7274,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7281,6 +7284,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7701,7 +7705,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7737,7 +7742,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7833,7 +7839,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7970,7 +7978,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index de4d4ef..9b1231e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..eaaf631 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e08320e..78d3ad1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4926,6 +4926,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+int
+ControlFileWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9843,6 +9852,18 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have disallowed creation of logical slots.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("logical decoding on standby requires wal_level >= logical on master"));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 88be7fe..431a302 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -878,7 +878,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	return true;
 }
 
-#ifdef FRONTEND
 /*
  * Functions that are currently not needed in the backend, but are better
  * implemented inside xlogreader.c because of the internal facilities available
@@ -1003,9 +1002,6 @@ out:
 	return found;
 }
 
-#endif							/* FRONTEND */
-
-
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bbd38c0..9f6e0ac 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,24 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+	if (RecoveryInProgress())
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (ControlFileWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
+
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +129,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
@@ -241,6 +260,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
@@ -474,6 +495,26 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 		 (uint32) (slot->data.restart_lsn >> 32),
 		 (uint32) slot->data.restart_lsn);
 
+	/*
+	 * It is not guaranteed that the restart_lsn points to a valid
+	 * record location. E.g. on standby, restart_lsn initially points to lastReplayedEndRecPtr,
+	 * which is 1 + the end of last replayed record, which means it can point the next
+	 * block header start. So bump it to the next valid record.
+	 */
+	if (!XRecOffIsValid(startptr))
+	{
+		elog(DEBUG1, "Invalid restart lsn %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+		startptr = XLogFindNextRecord(ctx->reader, startptr);
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = startptr;
+		SpinLockRelease(&slot->mutex);
+
+		elog(DEBUG1, "Moved slot restart lsn to %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+	}
+
 	/* Wait for a consistent starting point */
 	for (;;)
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..8c8d174 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1016,37 +1016,37 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record so that
+		 * a snapshot can be built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+
+		restart_lsn =
+			(SlotIsPhysical(slot) ? GetRedoRecPtr() :
+			(RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
+									GetXLogInsertRecPtr()));
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1065,6 +1065,114 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with slots.
+ *
+ * When xid is valid, it means it's a removed-xid kind of conflict, so need to
+ * drop the appropriate slots whose xmin conflicts with removed xid.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'reason' is provided for the
+ * error detail; otherwise reason is NULL.
+ */
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid, char *reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str;
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
+			 */
+			initStringInfo(&conflict_str);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				appendStringInfo(&conflict_str, "slot xmin: %d", slot_xmin);
+
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_str, "%sslot catalog_xmin: %d",
+								 conflict_str.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_str.len > 0)
+			{
+				appendStringInfo(&conflict_str, ", %s xid : %d",
+								 gettext_noop("removed"), xid);
+				found_conflict = true;
+				reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			ereport(LOG,
+					(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", reason)));
+
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25b7e31..a45345c 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index c13c08a..bd35bc1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 237f4e0..fa02728 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern int ControlFileWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 04228e2..a5ffffc 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -215,9 +215,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 /* Invalidate read state */
 extern void XLogReaderInvalReadState(XLogReaderState *state);
 
-#ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
-#endif							/* FRONTEND */
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea..3a90aac 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8d5ad6b..a9a1ac7 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2009,6 +2009,33 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+	sleep(1);
+
+	# Slot creation on standby waits for an xl_running_xacts record. So arrange
+	# for it.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	return 0;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/018_logical_decoding_on_replica.pl b/src/test/recovery/t/018_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..304f32a
--- /dev/null
+++ b/src/test/recovery/t/018_logical_decoding_on_replica.pl
@@ -0,0 +1,395 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+is($node_replica->create_logical_slot_on_standby($node_master, 'dodropslot', 'testdb'),
+   0, 'created dodropslot on testdb')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+is($node_replica->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'),
+   0, 'created otherslot on postgres')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#55

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#52)

Re: Minimal logical decoding on standbys

On Wed, 12 Jun 2019 at 00:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-May-23, Andres Freund wrote:

On 2019-05-23 09:37:50 -0400, Robert Haas wrote:

On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:

wal_level is PGC_POSTMASTER.

But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only logical) on standby. So online change from logical to replica wal_level is possible on standby's controlfile.

That's true, but Amit's scenario involved a change in wal_level during
the execution of pg_create_logical_replication_slot(), which I think
can't happen.

I don't see why not - we're talking about the wal_level in the WAL
stream, not the setting on the standby. And that can change during the
execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
record is replayed. I don't think it's actually a problem, as I
outlined in my response to Amit, though.

I don't know if this is directly relevant, but in commit_ts.c we go to
great lengths to ensure that things continue to work across restarts and
changes of the GUC in the primary, by decoupling activation and
deactivation of the module from start-time initialization. Maybe that
idea is applicable for this too?

We do kind of handle change in wal_level differently at run-time
versus at initialization. E.g. we drop the existing slots if the
wal_level becomes less than logical. But I think we don't have to do a
significant work unlike how it seems to have been done in
ActivateCommitTs when commit_ts is activated.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#56

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#53)

Re: Minimal logical decoding on standbys

Hi,

On 2019-06-12 17:30:02 +0530, Amit Khandekar wrote:

On Tue, 11 Jun 2019 at 12:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Mon, 10 Jun 2019 at 10:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Since this requires the test to handle the
fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
modifying the test file to do this. After doing that, I found that the
slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
This happens only when the restart_lsn is set to ReplayRecPtr.
Somehow, this does not happen when I manually create the logical slot.
It happens only while running testcase. Working on it ...

Like I mentioned above, I get an assertion failure for
Assert(XRecOffIsValid(RecPtr)) while reading WAL records looking for a
start position (DecodingContextFindStartpoint()). This is because in
CreateInitDecodingContext()=>ReplicationSlotReserveWal(), I now set
the logical slot's restart_lsn to XLogCtl->lastReplayedEndRecPtr. And
just after bringing up slave, lastReplayedEndRecPtr's initial values
are in this order : 0/2000028, 0/2000060, 0/20000D8, 0/2000100,
0/3000000, 0/3000060. You can see that 0/3000000 is not a valid value
because it points to the start of a WAL block, meaning it points to
the XLog page header (I think it's possible because it is 1 + endof
last replayed record, which can be start of next block). So when we
try to create a slot when it's in that position, then XRecOffIsValid()
fails while looking for a starting point.

One option I considered was : If lastReplayedEndRecPtr points to XLog
page header, get a position of the first record on that WAL block,
probably with XLogFindNextRecord(). But it is not trivial because
while in ReplicationSlotReserveWal(), XLogReaderState is not created
yet.

In the attached v6 version of the patch, I did the above. That is, I
used XLogFindNextRecord() to bump up the restart_lsn of the slot to
the first valid record. But since XLogReaderState is not available in
ReplicationSlotReserveWal(), I did this in
DecodingContextFindStartpoint(). And then updated the slot restart_lsn
with this corrected position.

Since XLogFindNextRecord() is currently disabled using #if 0, removed
this directive.

Well, ifdef FRONTEND. I don't think that's a problem. It's a bit
overkill here, because I think we know the address has to be on a record
boundary (rather than being in the middle of a page spanning WAL
record). So we could just add add the size of the header manually - but
I think that's not worth doing.

Or else, do you think we can just increment the record pointer by
doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
SizeOfXLogShortPHD() ?

I found out that we can't do this, because we don't know whether the
xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
our context, it is SizeOfXLogLongPHD. So we indeed need the
XLogReaderState handle.

Well, we can determine whether a long or a short header is going to be
used, as that's solely dependent on the LSN:

/*
* If first page of an XLOG segment file, make it a long header.
*/
if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
{
XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;

NewLongPage->xlp_sysid = ControlFile->system_identifier;
NewLongPage->xlp_seg_size = wal_segment_size;
NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
NewPage->xlp_info |= XLP_LONG_HEADER;
}

but I don't think that's worth it.

Do you think that we can solve this using some other approach ? I am
not sure whether it's only the initial conditions that cause
lastReplayedEndRecPtr value to *not* point to a valid record, or is it
just a coincidence and that lastReplayedEndRecPtr can also have such a
value any time afterwards.

It's always possible. All that means is that the last record filled the
entire last WAL page.

If it's only possible initially, we can
just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
lastReplayedEndRecPtr is invalid.

I don't think so? The redo pointer will point to something *much*
earlier, where we'll not yet have done all the necessary conflict
handling during recovery? So we'd not necessarily notice that a slot
is not actually usable for decoding.

We could instead just handle that by starting decoding at the redo
pointer, and just ignore all WAL records until they're after
lastReplayedEndRecPtr, but that has no advantages, and will read a lot
more WAL.

static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
*/
/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
xlrec_reuse.node = rel->rd_node;
xlrec_reuse.block = blkno;
xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
XLogRecPtr	recptr;
xl_btree_delete xlrec_delete;
+ xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
xlrec_delete.latestRemovedXid = latestRemovedXid;
xlrec_delete.nitems = nitems;

Can we instead pass the heap rel down to here? I think there's only one
caller, and it has the heap relation available these days (it didn't at
the time of the prototype, possibly). There's a few other users of
get_rel_logical_catalog() where that might be harder, but it's easy
here.

@@ -27,6 +27,7 @@
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
#include "utils/snapmgr.h"
+#include "utils/lsyscache.h"

/* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
OffsetNumber itemnos[MaxIndexTuplesPerPage];
spgxlogVacuumRedirect xlrec;

+ xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
xlrec.nToPlaceholder = 0;
xlrec.newestRedirectXid = InvalidTransactionId;

This one seems harder, but I'm not actually sure why we make it so
hard. It seems like we just ought to add the table to IndexVacuumInfo.

/*
+ * Get the wal_level from the control file.
+ */
+int
+ControlFileWalLevel(void)
+{
+	return ControlFile->wal_level;
+}

Any reason not to return the type enum WalLevel instead? I'm not sure I
like the function name - perhaps something like GetActiveWalLevel() or
such? The fact that it's in the control file doesn't seem relevant
here. I think it should be close to DataChecksumsEnabled() etc, which
all return information from the control file.

+/*
* Initialization of shared memory for XLOG
*/
Size
@@ -9843,6 +9852,17 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have disallowed creation of logical slots.
+		 */

s/disallowed creation/disallowed creation or previously dropped/

+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId);
+
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->MaxConnections = xlrec.MaxConnections;
ControlFile->max_worker_processes =
xlrec.max_worker_processes;

Not for this patch, but I kinda feel the individual replay routines
ought to be broken out of xlog_redo().

/* ----------------------------------------
* Functions for decoding the data and block references in a record.
* ----------------------------------------
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}

This should also HINT to drop the replication slot.

+	/*
+	 * It is not guaranteed that the restart_lsn points to a valid
+	 * record location. E.g. on standby, restart_lsn initially points to lastReplayedEndRecPtr,
+	 * which is 1 + the end of last replayed record, which means it can point the next
+	 * block header start. So bump it to the next valid record.
+	 */

I'd rephrase this as something like:

restart_lsn initially may point one past the end of the record. If that
is a XLOG page boundary, it will not be a valid LSN for the start of a
record. If that's the case, look for the start of the first record.

+ if (!XRecOffIsValid(startptr))
+ {

Hm, could you before this add an Assert(startptr != InvalidXLogRecPtr)
or such?

+		elog(DEBUG1, "Invalid restart lsn %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+		startptr = XLogFindNextRecord(ctx->reader, startptr);
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = startptr;
+		SpinLockRelease(&slot->mutex);
+		elog(DEBUG1, "Moved slot restart lsn to %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+	}

Minor nit: normally debug messages don't start with upper case.

/* Wait for a consistent starting point */
for (;;)
{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..7ffd264 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1016,37 +1016,37 @@ ReplicationSlotReserveWal(void)
/*
* For logical slots log a standby snapshot and start logical decoding
* at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record so that
+		 * a snapshot can be built using the record.

I'd add "to be logged independently on the primary" after "wait for an
xl_running_xact record".

-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
*/
+
+		restart_lsn =
+			(SlotIsPhysical(slot) ? GetRedoRecPtr() :
+			(RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
+									GetXLogInsertRecPtr()));

Please rewrite this to use normal if blocks. I'm also not convinced that
it's useful to have this if block, and then another if block that
basically tests the same conditions again.

+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
XLogRecPtr	flushptr;
- /* start at current insert position */
- restart_lsn = GetXLogInsertRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
-
/* make sure we have enough information to start */
flushptr = LogStandbySnapshot();

/* and make sure it's fsynced to disk */
XLogFlush(flushptr);
}
- else
- {
- restart_lsn = GetRedoRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
- }

/*
+ * Resolve recovery conflicts with slots.
+ *
+ * When xid is valid, it means it's a removed-xid kind of conflict, so need to
+ * drop the appropriate slots whose xmin conflicts with removed xid.

I don't think "removed-xid kind of conflict" is that descriptive. I'd
suggest something like "When xid is valid, it means that rows older than
xid might have been removed. Therefore we need to drop slots that depend
on seeing those rows."

+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped.
+ */
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}

Hm, I wonder if we could use the condition variable the slot
infrastructure has these days for this instead.

+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+		NameData	slotname;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
+			found_conflict = true;

I'd just add

if (!SlotIsLogical(s))
continue;

because all of this doesn't need to happen for slots that aren't
logical.

+		else
+		{
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				found_conflict = true;
+
+				ereport(LOG,
+						(errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+								NameStr(slotname), slot_xmin, xid)));
+			}

s/removed xid/xid horizon being increased to %u/

+			if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+			{
+				found_conflict = true;
+
+				ereport(LOG,
+						(errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+								NameStr(slotname), slot_catalog_xmin, xid)));
+			}
+
+		}
+		if (found_conflict)
+		{

Hm, as far as I can tell you just ignore that the slot might currently
be in use. You can't just drop a slot that somebody is using. I think
you need to send a recovery conflict to that backend.

I guess the easiest way to do that would be something roughly like:

SetInvalidVirtualTransactionId(vxid);

LWLockAcquire(ProcArrayLock, LW_SHARED);
cancel_proc = BackendPidGetProcWithLock(active_pid);
if (cancel_proc)
vxid = GET_VXID_FROM_PGPROC(cancel_proc);
LWLockRelease(ProcArrayLock);

if (VirtualTransactionIdIsValid(vixd))
{
CancelVirtualTransaction(vxid);

/* Wait here until we get signaled, and then restart */
ConditionVariableSleep(&slot->active_cv,
WAIT_EVENT_REPLICATION_SLOT_DROP);
}
ConditionVariableCancelSleep();

when the slot is currently active. Part of this would need to be split
into a procarray.c helper function (mainly all the stuff dealing with
ProcArrayLock).

+ elog(LOG, "Dropping conflicting slot %s", s->data.name.data);

This definitely needs to be expanded, and follow the message style
guideline.

+ LWLockRelease(ReplicationSlotControlLock); /* avoid deadlock */

Instead of saying "deadlock" I'd just say that ReplicationSlotDropPtr()
will acquire that lock.

+ ReplicationSlotDropPtr(s);

But more importantly, I don't think this is
correct. ReplicationSlotDropPtr() assumes that the to-be-dropped slot is
acquired by the current backend - without that somebody else could
concurrently acquire that slot.

SO I think you need to do something like ReplicationSlotsDropDBSlots()
does:

/* acquire slot, so ReplicationSlotDropAcquired can be reused */
SpinLockAcquire(&s->mutex);
/* can't change while ReplicationSlotControlLock is held */
slotname = NameStr(s->data.name);
active_pid = s->active_pid;
if (active_pid == 0)
{
MyReplicationSlot = s;
s->active_pid = MyProcPid;
}
SpinLockRelease(&s->mutex);

Greetings,

Andres Freund

#57

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: tushar (#5)

1 attachment(s)

Re: Minimal logical decoding on standbys

I am yet to work on Andres's latest detailed review comments, but I
thought before that, I should submit a patch for the below reported
issue because I was almost ready with the fix. Now I will start to
work on Andres's comments, for which I will reply separately.

On Fri, 1 Mar 2019 at 13:33, tushar <tushar.ahuja@enterprisedb.com> wrote:

Hi,

While testing this feature found that - if lots of insert happened on
the master cluster then pg_recvlogical is not showing the DATA
information on logical replication slot which created on SLAVE.

Please refer this scenario -

1)
Create a Master cluster with wal_level=logcal and create logical
replication slot -
SELECT * FROM pg_create_logical_replication_slot('master_slot',
'test_decoding');

2)
Create a Standby cluster using pg_basebackup ( ./pg_basebackup -D
slave/ -v -R) and create logical replication slot -
SELECT * FROM pg_create_logical_replication_slot('standby_slot',
'test_decoding');

3)
X terminal - start pg_recvlogical , provide port=5555 ( slave
cluster) and specify slot=standby_slot
./pg_recvlogical -d postgres -p 5555 -s 1 -F 1 -v --slot=standby_slot
--start -f -

Y terminal - start pg_recvlogical , provide port=5432 ( master
cluster) and specify slot=master_slot
./pg_recvlogical -d postgres -p 5432 -s 1 -F 1 -v --slot=master_slot
--start -f -

Z terminal - run pg_bench against Master cluster ( ./pg_bench -i -s 10
postgres)

Able to see DATA information on Y terminal but not on X.

but same able to see by firing this below query on SLAVE cluster -

SELECT * FROM pg_logical_slot_get_changes('standby_slot', NULL, NULL);

Is it expected ?

Actually it shows up records after quite a long time. In general,
walsender on standby is sending each record after significant time (1
sec), and pg_recvlogical shows all the inserted records only after the
commit, so for huge inserts, it looks like it is hanging forever.

In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
point. On standby, GetFlushRecPtr() does not give a valid value, so it
was wrongly determined that the sent record is beyond flush point, as
a result of which, WalSndCaughtUp was set to true, causing
WalSndLoop() to sleep for some duration after every record. This is
why pg_recvlogical appears to be hanging forever in case of huge
number of rows inserted.

Fix : Use GetStandbyFlushRecPtr() if am_cascading_walsender.
Attached patch v8.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v8.patchapplication/octet-stream; name=logical-decoding-on-standby_v8.patchDownload

From bc92aff893a63eb04912b93e980b18984b939135 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 20 Jun 2019 15:17:26 +0530
Subject: [PATCH] Logical decoding on standby - v8.

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

   Changes in v6 patch :

	  While creating the slot, lastReplayedEndRecPtr is used to set the
	  restart_lsn, but its position is later adjusted in
	  DecodingContextFindStartpoint() in case it does not point to a
	  valid record location. This can happen because replay pointer
	  points to 1 + end of last record replayed, which means it can
	  coincide with first byte of a new WAL block, i.e. inside block
	  header.

	  Also, modified the test to handle the requirement that the
	  logical slot creation on standby requires a checkpoint
	  (or any other transaction commit) to be given from master. For
	  that, in src/test/perl/PostgresNode.pm, added a new function
	  create_logical_slot_on_standby() which does the reqiured steps.

	Changes in v7 patch :
		Merge the two conflict messages for xmin and catalog_xmin into
		a single one.

	Changes in v8 :

	Fix incorrect flush ptr on standby.
    In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
    point. On standby, GetFlushRecPtr() does not give a valid value, so it
    was wrongly determined that the sent record is beyond flush point, as
    a result of which, WalSndCaughtUp was set to true, causing
    WalSndLoop() to sleep for some duration after every record.
	This was reported by Tushar Ahuja, where pg_recvlogical seems like it
	is hanging when there are loads of insert.
    Fix: Use GetStandbyFlushRecPtr() if am_cascading_walsender.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   3 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  21 ++
 src/backend/access/transam/xlogreader.c            |   4 -
 src/backend/replication/logical/decode.c           |  14 +-
 src/backend/replication/logical/logical.c          |  41 +++
 src/backend/replication/slot.c                     | 146 +++++++-
 src/backend/replication/walsender.c                |   8 +-
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/access/xlogreader.h                    |   2 -
 src/include/replication/slot.h                     |   2 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 src/test/perl/PostgresNode.pm                      |  27 ++
 .../recovery/t/018_logical_decoding_on_replica.pl  | 395 +++++++++++++++++++++
 31 files changed, 704 insertions(+), 47 deletions(-)
 create mode 100644 src/test/recovery/t/018_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b..10b7857 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7240,6 +7242,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7270,7 +7273,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7280,6 +7283,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7700,7 +7704,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7736,7 +7741,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7832,7 +7838,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7969,7 +7977,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index de4d4ef..9b1231e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..eaaf631 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e08320e..78d3ad1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4926,6 +4926,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+int
+ControlFileWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9843,6 +9852,18 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have disallowed creation of logical slots.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("logical decoding on standby requires wal_level >= logical on master"));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 88be7fe..431a302 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -878,7 +878,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	return true;
 }
 
-#ifdef FRONTEND
 /*
  * Functions that are currently not needed in the backend, but are better
  * implemented inside xlogreader.c because of the internal facilities available
@@ -1003,9 +1002,6 @@ out:
 	return found;
 }
 
-#endif							/* FRONTEND */
-
-
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bbd38c0..9f6e0ac 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,24 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+	if (RecoveryInProgress())
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (ControlFileWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
+
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +129,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
@@ -241,6 +260,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
@@ -474,6 +495,26 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 		 (uint32) (slot->data.restart_lsn >> 32),
 		 (uint32) slot->data.restart_lsn);
 
+	/*
+	 * It is not guaranteed that the restart_lsn points to a valid
+	 * record location. E.g. on standby, restart_lsn initially points to lastReplayedEndRecPtr,
+	 * which is 1 + the end of last replayed record, which means it can point the next
+	 * block header start. So bump it to the next valid record.
+	 */
+	if (!XRecOffIsValid(startptr))
+	{
+		elog(DEBUG1, "Invalid restart lsn %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+		startptr = XLogFindNextRecord(ctx->reader, startptr);
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = startptr;
+		SpinLockRelease(&slot->mutex);
+
+		elog(DEBUG1, "Moved slot restart lsn to %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+	}
+
 	/* Wait for a consistent starting point */
 	for (;;)
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..8c8d174 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1016,37 +1016,37 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record so that
+		 * a snapshot can be built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+
+		restart_lsn =
+			(SlotIsPhysical(slot) ? GetRedoRecPtr() :
+			(RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
+									GetXLogInsertRecPtr()));
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1065,6 +1065,114 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with slots.
+ *
+ * When xid is valid, it means it's a removed-xid kind of conflict, so need to
+ * drop the appropriate slots whose xmin conflicts with removed xid.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'reason' is provided for the
+ * error detail; otherwise reason is NULL.
+ */
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid, char *reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Wait awhile for them to die so that we avoid flooding an
+		 * unresponsive backend when system is heavily loaded.
+		 */
+		pg_usleep(100000);
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str;
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
+			 */
+			initStringInfo(&conflict_str);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				appendStringInfo(&conflict_str, "slot xmin: %d", slot_xmin);
+
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_str, "%sslot catalog_xmin: %d",
+								 conflict_str.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_str.len > 0)
+			{
+				appendStringInfo(&conflict_str, ", %s xid : %d",
+								 gettext_noop("removed"), xid);
+				found_conflict = true;
+				reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			ereport(LOG,
+					(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", reason)));
+
+			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
+			ReplicationSlotDropPtr(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 92fa86f..4ce7096 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2814,6 +2814,7 @@ XLogSendLogical(void)
 {
 	XLogRecord *record;
 	char	   *errm;
+	XLogRecPtr	flushPtr;
 
 	/*
 	 * Don't know whether we've caught up yet. We'll set WalSndCaughtUp to
@@ -2830,10 +2831,11 @@ XLogSendLogical(void)
 	if (errm != NULL)
 		elog(ERROR, "%s", errm);
 
+	flushPtr = (am_cascading_walsender ?
+				GetStandbyFlushRecPtr() : GetFlushRecPtr());
+
 	if (record != NULL)
 	{
-		/* XXX: Note that logical decoding cannot be used while in recovery */
-		XLogRecPtr	flushPtr = GetFlushRecPtr();
 
 		/*
 		 * Note the lack of any call to LagTrackerWrite() which is handled by
@@ -2857,7 +2859,7 @@ XLogSendLogical(void)
 		 * If the record we just wanted read is at or beyond the flushed
 		 * point, then we're caught up.
 		 */
-		if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
+		if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
 		{
 			WalSndCaughtUp = true;
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25b7e31..a45345c 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index c13c08a..bd35bc1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 237f4e0..fa02728 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern int ControlFileWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 04228e2..a5ffffc 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -215,9 +215,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 /* Invalidate read state */
 extern void XLogReaderInvalReadState(XLogReaderState *state);
 
-#ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
-#endif							/* FRONTEND */
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea..3a90aac 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8d5ad6b..a9a1ac7 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2009,6 +2009,33 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+	sleep(1);
+
+	# Slot creation on standby waits for an xl_running_xacts record. So arrange
+	# for it.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	return 0;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/018_logical_decoding_on_replica.pl b/src/test/recovery/t/018_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..304f32a
--- /dev/null
+++ b/src/test/recovery/t/018_logical_decoding_on_replica.pl
@@ -0,0 +1,395 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+is($node_replica->create_logical_slot_on_standby($node_master, 'dodropslot', 'testdb'),
+   0, 'created dodropslot on testdb')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+is($node_replica->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'),
+   0, 'created otherslot on postgres')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#58

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#56)

Re: Minimal logical decoding on standbys

On Thu, 20 Jun 2019 at 00:31, Andres Freund <andres@anarazel.de> wrote:

On 2019-06-12 17:30:02 +0530, Amit Khandekar wrote:

In the attached v6 version of the patch, I did the above. That is, I
used XLogFindNextRecord() to bump up the restart_lsn of the slot to
the first valid record. But since XLogReaderState is not available in
ReplicationSlotReserveWal(), I did this in
DecodingContextFindStartpoint(). And then updated the slot restart_lsn
with this corrected position.

Since XLogFindNextRecord() is currently disabled using #if 0, removed
this directive.

Well, ifdef FRONTEND. I don't think that's a problem. It's a bit
overkill here, because I think we know the address has to be on a record
boundary (rather than being in the middle of a page spanning WAL
record). So we could just add add the size of the header manually
- but I think that's not worth doing.

Or else, do you think we can just increment the record pointer by
doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
SizeOfXLogShortPHD() ?

I found out that we can't do this, because we don't know whether the
xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
our context, it is SizeOfXLogLongPHD. So we indeed need the
XLogReaderState handle.

Well, we can determine whether a long or a short header is going to be
used, as that's solely dependent on the LSN:

/*
* If first page of an XLOG segment file, make it a long header.
*/
if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
{
XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;

NewLongPage->xlp_sysid = ControlFile->system_identifier;
NewLongPage->xlp_seg_size = wal_segment_size;
NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
NewPage->xlp_info |= XLP_LONG_HEADER;
}

but I don't think that's worth it.

Ok, so what you are saying is : In case of ReplayRecPtr, it is always
possible to know whether it is pointing at a long header or short
header, just by looking at its value. And then we just increment it by
the header size after knowing the header size. Why do you think it is
no worth it ? In fact, I thought we *have* to increment it to set it
to the next record. Didn't understand what other option we have.

Do you think that we can solve this using some other approach ? I am
not sure whether it's only the initial conditions that cause
lastReplayedEndRecPtr value to *not* point to a valid record, or is it
just a coincidence and that lastReplayedEndRecPtr can also have such a
value any time afterwards.

It's always possible. All that means is that the last record filled the
entire last WAL page.

Ok that means we *have* to bump the pointer ahead.

If it's only possible initially, we can
just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
lastReplayedEndRecPtr is invalid.

I don't think so? The redo pointer will point to something *much*
earlier, where we'll not yet have done all the necessary conflict
handling during recovery? So we'd not necessarily notice that a slot
is not actually usable for decoding.

We could instead just handle that by starting decoding at the redo
pointer, and just ignore all WAL records until they're after
lastReplayedEndRecPtr, but that has no advantages, and will read a lot
more WAL.

Yeah I agree : just doing this for initial case is a bad idea.

static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
*/
/* XLOG stuff */
+     xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
xlrec_reuse.node = rel->rd_node;
xlrec_reuse.block = blkno;
xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
XLogRecPtr      recptr;
xl_btree_delete xlrec_delete;
+ xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
xlrec_delete.latestRemovedXid = latestRemovedXid;
xlrec_delete.nitems = nitems;
Can we instead pass the heap rel down to here? I think there's only one
caller, and it has the heap relation available these days (it didn't at
the time of the prototype, possibly). There's a few other users of
get_rel_logical_catalog() where that might be harder, but it's easy
here.

For _bt_log_reuse_page(), it's only caller is _bt_getbuf() which does
not have heapRel parameter. Let me know which caller you were
referring to that has heapRel.

For _bt_delitems_delete(), it itself has heapRel param, so I will use
this for onCatalogTable.

@@ -27,6 +27,7 @@
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
#include "utils/snapmgr.h"
+#include "utils/lsyscache.h"

/* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
OffsetNumber itemnos[MaxIndexTuplesPerPage];
spgxlogVacuumRedirect xlrec;

+ xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
xlrec.nToPlaceholder = 0;
xlrec.newestRedirectXid = InvalidTransactionId;

This one seems harder, but I'm not actually sure why we make it so
hard. It seems like we just ought to add the table to IndexVacuumInfo.

This means we have to add heapRel assignment wherever we initialize
IndexVacuumInfo structure, namely in lazy_vacuum_index(),
lazy_cleanup_index(), validate_index(), analyze_rel(), and make sure
these functions have a heap rel handle. Do you think we should do this
as part of this patch ?

+                     if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+                     {
+                             found_conflict = true;
+
+                             ereport(LOG,
+                                             (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+                                                             NameStr(slotname), slot_catalog_xmin, xid)));
+                     }
+
+             }
+             if (found_conflict)
+             {

The above changes seem to be from the older version (v6) of the patch.
Just wanted to make sure you are using v8 patch.

Hm, as far as I can tell you just ignore that the slot might currently
be in use. You can't just drop a slot that somebody is using. I think
you need to send a recovery conflict to that backend.

Yeah, I am currently working on this. As you suggested, I am going to
call CancelVirtualTransaction() and for its sigmode parameter, I will
pass a new ProcSignalReason value PROCSIG_RECOVERY_CONFLICT_SLOT.

+ elog(LOG, "Dropping conflicting slot %s", s->data.name.data);

This definitely needs to be expanded, and follow the message style
guideline.

This message , with the v8 patch, looks like this :
ereport(LOG,
(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
errdetail("%s", reason)));
where reason is a char string.

#59

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#56)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Thu, 20 Jun 2019 at 00:31, Andres Freund <andres@anarazel.de> wrote:

Or else, do you think we can just increment the record pointer by
doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
SizeOfXLogShortPHD() ?

I found out that we can't do this, because we don't know whether the
xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
our context, it is SizeOfXLogLongPHD. So we indeed need the
XLogReaderState handle.

Well, we can determine whether a long or a short header is going to be
used, as that's solely dependent on the LSN:

Discussion of this point (plus some more points) is in a separate
reply. You can reply to my comments there :
/messages/by-id/CAJ3gD9f_HjQ6qP=+1jwzwy77fwcbT4-M3UvVsqpAzsY-jqM8nw@mail.gmail.com

/*
+ * Get the wal_level from the control file.
+ */
+int
+ControlFileWalLevel(void)
+{
+     return ControlFile->wal_level;
+}
Any reason not to return the type enum WalLevel instead? I'm not sure I
like the function name - perhaps something like GetActiveWalLevel() or
such? The fact that it's in the control file doesn't seem relevant
here. I think it should be close to DataChecksumsEnabled() etc, which
all return information from the control file.

Done.

+/*
* Initialization of shared memory for XLOG
*/
Size
@@ -9843,6 +9852,17 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
+             /*
+              * Drop logical slots if we are in hot standby and master does not have
+              * logical data. Don't bother to search for the slots if standby is
+              * running with wal_level lower than logical, because in that case,
+              * we would have disallowed creation of logical slots.
+              */
s/disallowed creation/disallowed creation or previously dropped/

Did this :
* we would have either disallowed creation of logical slots or dropped
* existing ones.

+             if (InRecovery && InHotStandby &&
+                     xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+                     wal_level >= WAL_LEVEL_LOGICAL)
+                     ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId);
+
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->MaxConnections = xlrec.MaxConnections;
ControlFile->max_worker_processes =
xlrec.max_worker_processes;

Not for this patch, but I kinda feel the individual replay routines
ought to be broken out of xlog_redo().

Yeah, agree.

/* ----------------------------------------
* Functions for decoding the data and block references in a record.
* ----------------------------------------
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+             case XLOG_PARAMETER_CHANGE:
+             {
+                     xl_parameter_change *xlrec =
+                             (xl_parameter_change *) XLogRecGetData(buf->record);
+
+                     /* Cannot proceed if master itself does not have logical data */
+                     if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+                             ereport(ERROR,
+                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                              errmsg("logical decoding on standby requires "
+                                                             "wal_level >= logical on master")));
+                     break;
+             }

This should also HINT to drop the replication slot.

In this case, DecodeXLogOp() is being called because somebody is using
the slot itself. Not sure if it makes sense to hint the user to drop
the very slot that he/she is using. It would have made better sense to
hint about dropping the slot if something else was being done that
does not require a slot, but because the slot is becoming a nuisance,
we hint to drop the slot so as to avoid the error. What do you say ?
Probably the error message itself hints at setting the wal-level back
to logical.

+     /*
+      * It is not guaranteed that the restart_lsn points to a valid
+      * record location. E.g. on standby, restart_lsn initially points to lastReplayedEndRecPtr,
+      * which is 1 + the end of last replayed record, which means it can point the next
+      * block header start. So bump it to the next valid record.
+      */
I'd rephrase this as something like:

restart_lsn initially may point one past the end of the record. If that
is a XLOG page boundary, it will not be a valid LSN for the start of a
record. If that's the case, look for the start of the first record.

Done.

+ if (!XRecOffIsValid(startptr))
+ {

Hm, could you before this add an Assert(startptr != InvalidXLogRecPtr)
or such?

Yeah, done

+             elog(DEBUG1, "Invalid restart lsn %X/%X",
+                                      (uint32) (startptr >> 32), (uint32) startptr);
+             startptr = XLogFindNextRecord(ctx->reader, startptr);
+
+             SpinLockAcquire(&slot->mutex);
+             slot->data.restart_lsn = startptr;
+             SpinLockRelease(&slot->mutex);
+             elog(DEBUG1, "Moved slot restart lsn to %X/%X",
+                                      (uint32) (startptr >> 32), (uint32) startptr);
+     }

Minor nit: normally debug messages don't start with upper case.

Done.

/* Wait for a consistent starting point */
for (;;)
{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..7ffd264 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1016,37 +1016,37 @@ ReplicationSlotReserveWal(void)
/*
* For logical slots log a standby snapshot and start logical decoding
* at exactly that position. That allows the slot to start up more
-              * quickly.
+              * quickly. But on a standby we cannot do WAL writes, so just use the
+              * replay pointer; effectively, an attempt to create a logical slot on
+              * standby will cause it to wait for an xl_running_xact record so that
+              * a snapshot can be built using the record.

I'd add "to be logged independently on the primary" after "wait for an
xl_running_xact record".

Done.

-              * That's not needed (or indeed helpful) for physical slots as they'll
-              * start replay at the last logged checkpoint anyway. Instead return
-              * the location of the last redo LSN. While that slightly increases
-              * the chance that we have to retry, it's where a base backup has to
-              * start replay at.
+              * None of this is needed (or indeed helpful) for physical slots as
+              * they'll start replay at the last logged checkpoint anyway. Instead
+              * return the location of the last redo LSN. While that slightly
+              * increases the chance that we have to retry, it's where a base backup
+              * has to start replay at.
*/
+
+             restart_lsn =
+                     (SlotIsPhysical(slot) ? GetRedoRecPtr() :
+                     (RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
+                                                                     GetXLogInsertRecPtr()));

Please rewrite this to use normal if blocks. I'm also not convinced that
it's useful to have this if block, and then another if block that
basically tests the same conditions again.

Will check and get back on this one.

/*
+ * Resolve recovery conflicts with slots.
+ *
+ * When xid is valid, it means it's a removed-xid kind of conflict, so need to
+ * drop the appropriate slots whose xmin conflicts with removed xid.
I don't think "removed-xid kind of conflict" is that descriptive. I'd
suggest something like "When xid is valid, it means that rows older than
xid might have been removed. Therefore we need to drop slots that depend
on seeing those rows."

Done.

+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped.
+ */
+void
+ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
+{
+     int                     i;
+     bool            found_conflict = false;
+
+     if (max_replication_slots <= 0)
+             return;
+
+restart:
+     if (found_conflict)
+     {
+             CHECK_FOR_INTERRUPTS();
+             /*
+              * Wait awhile for them to die so that we avoid flooding an
+              * unresponsive backend when system is heavily loaded.
+              */
+             pg_usleep(100000);
+             found_conflict = false;
+     }

Hm, I wonder if we could use the condition variable the slot
infrastructure has these days for this instead.

Removed the pg_usleep, since in the attached patch, we now sleep on
the condition variable just after sending a recovery conflict signal
is sent. Details down below.

+     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+     for (i = 0; i < max_replication_slots; i++)
+     {
+             ReplicationSlot *s;
+             NameData        slotname;
+             TransactionId slot_xmin;
+             TransactionId slot_catalog_xmin;
+
+             s = &ReplicationSlotCtl->replication_slots[i];
+
+             /* cannot change while ReplicationSlotCtlLock is held */
+             if (!s->in_use)
+                     continue;
+
+             /* Invalid xid means caller is asking to drop all logical slots */
+             if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
+                     found_conflict = true;

I'd just add

if (!SlotIsLogical(s))
continue;

because all of this doesn't need to happen for slots that aren't
logical.

Yeah right. Done. Also renamed the function to
ResolveRecoveryConflictWithLogicalSlots() to emphasize that it is only
for logical slots.

+             else
+             {
+                     /* not our database, skip */
+                     if (s->data.database != InvalidOid && s->data.database != dboid)
+                             continue;
+
+                     SpinLockAcquire(&s->mutex);
+                     slotname = s->data.name;
+                     slot_xmin = s->data.xmin;
+                     slot_catalog_xmin = s->data.catalog_xmin;
+                     SpinLockRelease(&s->mutex);
+
+                     if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+                     {
+                             found_conflict = true;
+
+                             ereport(LOG,
+                                             (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
+                                                             NameStr(slotname), slot_xmin, xid)));
+                     }

s/removed xid/xid horizon being increased to %u/

BTW, this message belongs to an older patch. Check v7 onwards for
latest way I used for generating the message. Anyway, I have used the
above suggestion. Now the message detail will look like :
slot xmin: 1234, slot catalog_xmin: 5678, conflicts with xid horizon
being increased to 9012"

+                     if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+                     {
+                             found_conflict = true;
+
+                             ereport(LOG,
+                                             (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
+                                                             NameStr(slotname), slot_catalog_xmin, xid)));
+                     }
+
+             }
+             if (found_conflict)
+             {

Hm, as far as I can tell you just ignore that the slot might currently
be in use. You can't just drop a slot that somebody is using.

Yeah, I missed that.

I think
you need to send a recovery conflict to that backend.

I guess the easiest way to do that would be something roughly like:

SetInvalidVirtualTransactionId(vxid);

LWLockAcquire(ProcArrayLock, LW_SHARED);
cancel_proc = BackendPidGetProcWithLock(active_pid);
if (cancel_proc)
vxid = GET_VXID_FROM_PGPROC(cancel_proc);
LWLockRelease(ProcArrayLock);

if (VirtualTransactionIdIsValid(vixd))
{
CancelVirtualTransaction(vxid);

/* Wait here until we get signaled, and then restart */
ConditionVariableSleep(&slot->active_cv,
WAIT_EVENT_REPLICATION_SLOT_DROP);
}
ConditionVariableCancelSleep();

when the slot is currently active.

Did that now. Check the new function ReplicationSlotDropConflicting().

Also the below code is something that I added :
* Note: Even if vxid.localTransactionId is invalid, we need to cancel
* that backend, because there is no other way to make it release the
* slot. So don't bother to validate vxid.localTransactionId.
*/
if (vxid.backendId == InvalidBackendId)
continue;

This was done so that we could kill walsender in case pg_recvlogical
made it acquire the slot that we want to drop. walsender does not have
a local transaction id it seems. But CancelVirtualTransaction() works
also if vxid.localTransactionId is invalid. I have added comments to
explain this in CancelVirtualTransaction().

Part of this would need to be split
into a procarray.c helper function (mainly all the stuff dealing with
ProcArrayLock).

I didn't have to split it, by the way.

+ elog(LOG, "Dropping conflicting slot %s", s->data.name.data);

This definitely needs to be expanded, and follow the message style
guideline.

v7 patch onvwards, the message looks :
ereport(LOG,
(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
errdetail("%s", conflict_reason)));
Does that suffice ?

+ LWLockRelease(ReplicationSlotControlLock); /* avoid deadlock */

Instead of saying "deadlock" I'd just say that ReplicationSlotDropPtr()
will acquire that lock.

Done

+ ReplicationSlotDropPtr(s);

But more importantly, I don't think this is
correct. ReplicationSlotDropPtr() assumes that the to-be-dropped slot is
acquired by the current backend - without that somebody else could
concurrently acquire that slot.

SO I think you need to do something like ReplicationSlotsDropDBSlots()
does:

/* acquire slot, so ReplicationSlotDropAcquired can be reused */
SpinLockAcquire(&s->mutex);
/* can't change while ReplicationSlotControlLock is held */
slotname = NameStr(s->data.name);
active_pid = s->active_pid;
if (active_pid == 0)
{
MyReplicationSlot = s;
s->active_pid = MyProcPid;
}
SpinLockRelease(&s->mutex);

I have now done this in ReplicationSlotDropConflicting() itself.

Greetings,

Andres Freund

I have also removed the code inside #ifdef NOT_ANYMORE that errors out
with "logical decoding cannot be used while in recovery".

I have introduced a new procsignal reason
PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT so that when the conflicting
logical slot is dropped, a new error detail will be shown : "User was
using the logical slot that must be dropped".
Accordingly, added PgStat_StatDBEntry.n_conflict_logicalslot field.

Also, in RecoveryConflictInterrupt(), had to do some special handling
for am_cascading_walsender, so that a conflicting walsender on standby
will be terminated irrespective of the transaction status.

Attached v9 patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v9.patchapplication/octet-stream; name=logical-decoding-on-standby_v9.patchDownload

From 24ab7a9da9976cc67fe9b1a374efcf10257eac4a Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Mon, 24 Jun 2019 23:42:42 +0530
Subject: [PATCH] Logical decoding on standby - v9.

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

Changes in v6 patch :

While creating the slot, lastReplayedEndRecPtr is used to set the
restart_lsn, but its position is later adjusted in
DecodingContextFindStartpoint() in case it does not point to a
valid record location. This can happen because replay pointer
points to 1 + end of last record replayed, which means it can
coincide with first byte of a new WAL block, i.e. inside block
header.

Also, modified the test to handle the requirement that the
logical slot creation on standby requires a checkpoint
(or any other transaction commit) to be given from master. For
that, in src/test/perl/PostgresNode.pm, added a new function
create_logical_slot_on_standby() which does the reqiured steps.

Changes in v7 patch :

Merge the two conflict messages for xmin and catalog_xmin into
a single one.

Changes in v8 :

Fix incorrect flush ptr on standby (reported by Tushar Ahuja).
In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
point. On standby, GetFlushRecPtr() does not give a valid value, so it
was wrongly determined that the sent record is beyond flush point, as
a result of which, WalSndCaughtUp was set to true, causing
WalSndLoop() to sleep for some duration after every record.
This was reported by Tushar Ahuja, where pg_recvlogical seems like it
is hanging when there are loads of insert.
Fix: Use GetStandbyFlushRecPtr() if am_cascading_walsender

Changes in v9 :
While dropping a conflicting logical slot, if a backend has acquired it, send
it a conflict recovery signal. Check new function ReplicationSlotDropConflicting().
Also, miscellaneous review comments addressed, but not all of them yet.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   4 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  22 ++
 src/backend/access/transam/xlogreader.c            |   4 -
 src/backend/postmaster/pgstat.c                    |   4 +
 src/backend/replication/logical/decode.c           |  14 +-
 src/backend/replication/logical/logical.c          |  42 +++
 src/backend/replication/slot.c                     | 212 ++++++++++-
 src/backend/replication/walsender.c                |   8 +-
 src/backend/storage/ipc/procarray.c                |   4 +
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/tcop/postgres.c                        |  23 +-
 src/backend/utils/adt/pgstatfuncs.c                |   1 +
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/access/xlogreader.h                    |   2 -
 src/include/pgstat.h                               |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 src/test/perl/PostgresNode.pm                      |  27 ++
 .../recovery/t/018_logical_decoding_on_replica.pl  | 395 +++++++++++++++++++++
 38 files changed, 809 insertions(+), 48 deletions(-)
 create mode 100644 src/test/recovery/t/018_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b..10b7857 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7240,6 +7242,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7270,7 +7273,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7280,6 +7283,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7700,7 +7704,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7736,7 +7741,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7832,7 +7838,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7969,7 +7977,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 0357030..6b641c9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..eaaf631 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e08320e..2fe1de2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4926,6 +4926,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9843,6 +9852,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have either disallowed creation of logical slots or dropped
+		 * existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("logical decoding on standby requires wal_level >= logical on master"));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 88be7fe..431a302 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -878,7 +878,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	return true;
 }
 
-#ifdef FRONTEND
 /*
  * Functions that are currently not needed in the backend, but are better
  * implemented inside xlogreader.c because of the internal facilities available
@@ -1003,9 +1002,6 @@ out:
 	return found;
 }
 
-#endif							/* FRONTEND */
-
-
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b4f2b28..797ea0c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4728,6 +4728,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6352,6 +6353,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bbd38c0..347eba7 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,6 +94,24 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+	if (RecoveryInProgress())
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
+
+#ifdef NOT_ANYMORE
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
@@ -111,6 +129,7 @@ CheckLogicalDecodingRequirements(void)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("logical decoding cannot be used while in recovery")));
+#endif
 }
 
 /*
@@ -241,6 +260,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
@@ -474,6 +495,27 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 		 (uint32) (slot->data.restart_lsn >> 32),
 		 (uint32) slot->data.restart_lsn);
 
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	/*
+	 * restart_lsn initially may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the start of a
+	 * record. If that's the case, look for the start of the first record.
+	 */
+	if (!XRecOffIsValid(startptr))
+	{
+		elog(DEBUG1, "invalid restart lsn %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+		startptr = XLogFindNextRecord(ctx->reader, startptr);
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = startptr;
+		SpinLockRelease(&slot->mutex);
+
+		elog(DEBUG1, "moved slot restart lsn to %X/%X",
+					 (uint32) (startptr >> 32), (uint32) startptr);
+	}
+
 	/* Wait for a consistent starting point */
 	for (;;)
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..6312a3a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -101,6 +102,7 @@ int			max_replication_slots = 0;	/* the maximum number of replication
 
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -638,6 +640,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 }
 
 /*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+/*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
  */
@@ -1016,37 +1076,38 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+
+		restart_lsn =
+			(SlotIsPhysical(slot) ? GetRedoRecPtr() :
+			(RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
+									GetXLogInsertRecPtr()));
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1065,6 +1126,119 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str;
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "slot xmin: 1234, slot catalog_xmin: 5678, conflicts with xid
+			 * horizon being increased to 9012"
+			 */
+			initStringInfo(&conflict_str);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				appendStringInfo(&conflict_str, "slot xmin: %d", slot_xmin);
+
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_str, "%sslot catalog_xmin: %d",
+								 conflict_str.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_str.len > 0)
+			{
+				appendStringInfo(&conflict_str, ", %s %d",
+								 gettext_noop("conflicts with xid horizon being increased to"),
+								 xid);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			ereport(LOG,
+					(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* ReplicationSlotDropPtr() would acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 92fa86f..4ce7096 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2814,6 +2814,7 @@ XLogSendLogical(void)
 {
 	XLogRecord *record;
 	char	   *errm;
+	XLogRecPtr	flushPtr;
 
 	/*
 	 * Don't know whether we've caught up yet. We'll set WalSndCaughtUp to
@@ -2830,10 +2831,11 @@ XLogSendLogical(void)
 	if (errm != NULL)
 		elog(ERROR, "%s", errm);
 
+	flushPtr = (am_cascading_walsender ?
+				GetStandbyFlushRecPtr() : GetFlushRecPtr());
+
 	if (record != NULL)
 	{
-		/* XXX: Note that logical decoding cannot be used while in recovery */
-		XLogRecPtr	flushPtr = GetFlushRecPtr();
 
 		/*
 		 * Note the lack of any call to LagTrackerWrite() which is handled by
@@ -2857,7 +2859,7 @@ XLogSendLogical(void)
 		 * If the record we just wanted read is at or beyond the flushed
 		 * point, then we're caught up.
 		 */
-		if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
+		if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
 		{
 			WalSndCaughtUp = true;
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 18a0f62..ec696f4 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2669,6 +2669,10 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7605b2c..645f320 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -286,6 +286,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25b7e31..7cfb6d5 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1..c23d361 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2393,6 +2393,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
@@ -2920,7 +2942,6 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
-
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 05240bf..7dfbef7 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 						  dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index c13c08a..bd35bc1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 237f4e0..e7439c1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 04228e2..a5ffffc 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -215,9 +215,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 /* Invalidate read state */
 extern void XLogReaderInvalReadState(XLogReaderState *state);
 
-#ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
-#endif							/* FRONTEND */
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a3ad3a..4fe8684 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -604,6 +604,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea..73b954e 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 05b186a..956d3c2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -39,6 +39,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37..719837d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2000,6 +2000,33 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+	sleep(1);
+
+	# Slot creation on standby waits for an xl_running_xacts record. So arrange
+	# for it.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	return 0;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/018_logical_decoding_on_replica.pl b/src/test/recovery/t/018_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..304f32a
--- /dev/null
+++ b/src/test/recovery/t/018_logical_decoding_on_replica.pl
@@ -0,0 +1,395 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+is($node_replica->create_logical_slot_on_standby($node_master, 'dodropslot', 'testdb'),
+   0, 'created dodropslot on testdb')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+is($node_replica->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'),
+   0, 'created otherslot on postgres')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#60

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#59)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Mon, 24 Jun 2019 at 23:58, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Thu, 20 Jun 2019 at 00:31, Andres Freund <andres@anarazel.de> wrote:

Or else, do you think we can just increment the record pointer by
doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
SizeOfXLogShortPHD() ?

I found out that we can't do this, because we don't know whether the
xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
our context, it is SizeOfXLogLongPHD. So we indeed need the
XLogReaderState handle.

Well, we can determine whether a long or a short header is going to be
used, as that's solely dependent on the LSN:

Discussion of this point (plus some more points) is in a separate
reply. You can reply to my comments there :
/messages/by-id/CAJ3gD9f_HjQ6qP=+1jwzwy77fwcbT4-M3UvVsqpAzsY-jqM8nw@mail.gmail.com

As you suggested, I have used XLogSegmentOffset() to know the header
size, and bumped the restart_lsn in ReplicationSlotReserveWal() rather
than DecodingContextFindStartpoint(). Like I mentioned in the above
link, I am not sure why it's not worth doing this like you said.

-              * That's not needed (or indeed helpful) for physical slots as they'll
-              * start replay at the last logged checkpoint anyway. Instead return
-              * the location of the last redo LSN. While that slightly increases
-              * the chance that we have to retry, it's where a base backup has to
-              * start replay at.
+              * None of this is needed (or indeed helpful) for physical slots as
+              * they'll start replay at the last logged checkpoint anyway. Instead
+              * return the location of the last redo LSN. While that slightly
+              * increases the chance that we have to retry, it's where a base backup
+              * has to start replay at.
*/
+
+             restart_lsn =
+                     (SlotIsPhysical(slot) ? GetRedoRecPtr() :
+                     (RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
+                                                                     GetXLogInsertRecPtr()));

Please rewrite this to use normal if blocks.

Ok, done.

I'm also not convinced that
it's useful to have this if block, and then another if block that
basically tests the same conditions again.

Will check and get back on this one.

Those conditions are not exactly same. restart_lsn is assigned three
different pointers depending upon three different conditions. And
LogStandbySnapshot() is to be done only for combination of two
specific conditions. So we need to have two different condition
blocks.
Also, it's better if we have the
"assign-slot-restart_lsn-under-spinlock" in a common code, rather than
repeating it in two different blocks.

We can do something like :

if (!RecoveryInProgress() && SlotIsLogical(slot))
{
restart_lsn = GetXLogInsertRecPtr();
/* Assign restart_lsn to slot restart_lsn under Spinlock */
/* Log standby snapshot and fsync to disk */
}
else
{
if (SlotIsPhysical(slot))
restart_lsn = GetRedoRecPtr();
else if (RecoveryInProgress())
restart_lsn = GetXLogReplayRecPtr(NULL);
else
restart_lsn = GetXLogInsertRecPtr();

/* Assign restart_lsn to slot restart_lsn under Spinlock */
}

But I think better/simpler thing would be to take out the
assign-slot-restart_lsn outside of the two condition blocks into a
common location, like this :

if (SlotIsPhysical(slot))
restart_lsn = GetRedoRecPtr();
else if (RecoveryInProgress())
restart_lsn = GetXLogReplayRecPtr(NULL);
else
restart_lsn = GetXLogInsertRecPtr();

/* Assign restart_lsn to slot restart_lsn under Spinlock */

if (!RecoveryInProgress() && SlotIsLogical(slot))
{
/ * Log standby snapshot and fsync to disk */
}

So in the updated patch (v10), I have done as above.

Attachments:

logical-decoding-on-standby_v10.patchapplication/octet-stream; name=logical-decoding-on-standby_v10.patchDownload

From f432ba4f782e25db93039a87445696886a1fa479 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 25 Jun 2019 15:51:32 +0530
Subject: [PATCH] Logical decoding on standby - v10

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

Changes in v6 patch :

While creating the slot, lastReplayedEndRecPtr is used to set the
restart_lsn, but its position is later adjusted in
DecodingContextFindStartpoint() in case it does not point to a
valid record location. This can happen because replay pointer
points to 1 + end of last record replayed, which means it can
coincide with first byte of a new WAL block, i.e. inside block
header.

Also, modified the test to handle the requirement that the
logical slot creation on standby requires a checkpoint
(or any other transaction commit) to be given from master. For
that, in src/test/perl/PostgresNode.pm, added a new function
create_logical_slot_on_standby() which does the reqiured steps.

Changes in v7 patch :

Merge the two conflict messages for xmin and catalog_xmin into
a single one.

Changes in v8 :

Fix incorrect flush ptr on standby (reported by Tushar Ahuja).
In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
point. On standby, GetFlushRecPtr() does not give a valid value, so it
was wrongly determined that the sent record is beyond flush point, as
a result of which, WalSndCaughtUp was set to true, causing
WalSndLoop() to sleep for some duration after every record.
This was reported by Tushar Ahuja, where pg_recvlogical seems like it
is hanging when there are loads of insert.
Fix: Use GetStandbyFlushRecPtr() if am_cascading_walsender

Changes in v9 :
While dropping a conflicting logical slot, if a backend has acquired it, send
it a conflict recovery signal. Check new function ReplicationSlotDropConflicting().
Also, miscellaneous review comments addressed, but not all of them yet.

Changes in v10 :
Adjust restart_lsn if it's a Replay Pointer.
This was earlier done in DecodingContextFindStartpoint() but now it
is done in in ReplicationSlotReserveWal(), when restart_lsn is initialized.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   4 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  22 ++
 src/backend/postmaster/pgstat.c                    |   4 +
 src/backend/replication/logical/decode.c           |  14 +-
 src/backend/replication/logical/logical.c          |  33 +-
 src/backend/replication/slot.c                     | 230 +++++++++++-
 src/backend/replication/walsender.c                |   8 +-
 src/backend/storage/ipc/procarray.c                |   4 +
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/tcop/postgres.c                        |  23 +-
 src/backend/utils/adt/pgstatfuncs.c                |   1 +
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/pgstat.h                               |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 src/test/perl/PostgresNode.pm                      |  27 ++
 .../recovery/t/018_logical_decoding_on_replica.pl  | 395 +++++++++++++++++++++
 36 files changed, 802 insertions(+), 58 deletions(-)
 create mode 100644 src/test/recovery/t/018_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b..10b7857 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7240,6 +7242,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7270,7 +7273,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7280,6 +7283,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7700,7 +7704,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7736,7 +7741,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7832,7 +7838,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7969,7 +7977,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 0357030..6b641c9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..eaaf631 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e08320e..2fe1de2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4926,6 +4926,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9843,6 +9852,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have either disallowed creation of logical slots or dropped
+		 * existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("logical decoding on standby requires wal_level >= logical on master"));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b4f2b28..797ea0c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4728,6 +4728,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6352,6 +6353,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bbd38c0..4169828 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,23 +94,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -241,6 +240,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..fcffba2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -101,6 +102,7 @@ int			max_replication_slots = 0;	/* the maximum number of replication
 
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -638,6 +640,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 }
 
 /*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+/*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
  */
@@ -1016,37 +1076,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1065,6 +1144,119 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str;
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "slot xmin: 1234, slot catalog_xmin: 5678, conflicts with xid
+			 * horizon being increased to 9012"
+			 */
+			initStringInfo(&conflict_str);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				appendStringInfo(&conflict_str, "slot xmin: %d", slot_xmin);
+
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_str, "%sslot catalog_xmin: %d",
+								 conflict_str.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_str.len > 0)
+			{
+				appendStringInfo(&conflict_str, ", %s %d",
+								 gettext_noop("conflicts with xid horizon being increased to"),
+								 xid);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			ereport(LOG,
+					(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* ReplicationSlotDropPtr() would acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 92fa86f..4ce7096 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2814,6 +2814,7 @@ XLogSendLogical(void)
 {
 	XLogRecord *record;
 	char	   *errm;
+	XLogRecPtr	flushPtr;
 
 	/*
 	 * Don't know whether we've caught up yet. We'll set WalSndCaughtUp to
@@ -2830,10 +2831,11 @@ XLogSendLogical(void)
 	if (errm != NULL)
 		elog(ERROR, "%s", errm);
 
+	flushPtr = (am_cascading_walsender ?
+				GetStandbyFlushRecPtr() : GetFlushRecPtr());
+
 	if (record != NULL)
 	{
-		/* XXX: Note that logical decoding cannot be used while in recovery */
-		XLogRecPtr	flushPtr = GetFlushRecPtr();
 
 		/*
 		 * Note the lack of any call to LagTrackerWrite() which is handled by
@@ -2857,7 +2859,7 @@ XLogSendLogical(void)
 		 * If the record we just wanted read is at or beyond the flushed
 		 * point, then we're caught up.
 		 */
-		if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
+		if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
 		{
 			WalSndCaughtUp = true;
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 18a0f62..ec696f4 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2669,6 +2669,10 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7605b2c..645f320 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -286,6 +286,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25b7e31..7cfb6d5 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1..c23d361 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2393,6 +2393,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
@@ -2920,7 +2942,6 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
-
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 05240bf..7dfbef7 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 						  dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index c13c08a..bd35bc1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 237f4e0..e7439c1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a3ad3a..4fe8684 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -604,6 +604,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea..73b954e 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 05b186a..956d3c2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -39,6 +39,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37..719837d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2000,6 +2000,33 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+	sleep(1);
+
+	# Slot creation on standby waits for an xl_running_xacts record. So arrange
+	# for it.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	return 0;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/018_logical_decoding_on_replica.pl b/src/test/recovery/t/018_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..304f32a
--- /dev/null
+++ b/src/test/recovery/t/018_logical_decoding_on_replica.pl
@@ -0,0 +1,395 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 51;
+use RecursiveCopy;
+use File::Copy;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=decoding_standby');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_phys_mins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('decoding_standby');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'decoding_standby']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin, "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+is($new_logical_xmin, '', "logical xmin null");
+isnt($new_logical_catalog_xmin, '', "logical slot catalog_xmin not null");
+cmp_ok($new_logical_catalog_xmin, ">", $logical_catalog_xmin, "logical slot catalog_xmin advanced after get_changes");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+sleep(2); # ensure walreceiver feedback sent
+
+my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+isnt($new_physical_xmin, '', "physical xmin not null");
+# hot standby feedback should advance phys catalog_xmin now that the standby's
+# slot doesn't hold it down as far.
+isnt($new_physical_catalog_xmin, '', "physical catalog_xmin not null");
+cmp_ok($new_physical_catalog_xmin, ">", $physical_catalog_xmin, "physical catalog_xmin advanced");
+
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin, 'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin, 'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+########################################################################
+# Recovery conflict: conflicting replication slot should get dropped
+########################################################################
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+isnt($ret, 0, 'usage of slot failed as expected');
+like($stderr, qr/does not exist/, 'slot not found as expected');
+
+# Re-create the slot now that we know it is dropped
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+# Set hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. Both should be non-NULL since hs_feedback is on and
+# there is a logical slot present on standby.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_phys_mins($node_master, 'decoding_standby',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery: drop database drops idle slots
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB on the upstream if they're on the right DB, or not dropped if on
+# another DB.
+
+is($node_replica->create_logical_slot_on_standby($node_master, 'dodropslot', 'testdb'),
+   0, 'created dodropslot on testdb')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+is($node_replica->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'),
+   0, 'created otherslot on postgres')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+is($node_replica->slot('dodropslot')->{'slot_type'}, 'logical', 'slot dodropslot on standby created');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'slot otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby not dropped');
+
+
+##################################################
+# Recovery: drop database drops in-use slots
+##################################################
+
+# This time, have the slot in-use on the downstream DB when we drop it.
+print "Testing dropdb when downstream slot is in-use";
+$node_master->psql('postgres', q[CREATE DATABASE testdb2]);
+
+print "creating slot dodropslot2";
+$node_replica->command_ok(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-P', 'test_decoding', '-S', 'dodropslot2', '--create-slot'],
+	'pg_recvlogical created slot test_decoding');
+is($node_replica->slot('dodropslot2')->{'slot_type'}, 'logical', 'slot dodropslot2 on standby created');
+
+# make sure the slot is in use
+print "starting pg_recvlogical";
+$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb2'), '-S', 'dodropslot2', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+sleep(1);
+
+is($node_replica->slot('dodropslot2')->{'active'}, 't', 'slot on standby is active')
+  or BAIL_OUT("slot not active on standby, cannot continue. pg_recvlogical exited with '$stdout', '$stderr'");
+
+# Master doesn't know the replica's slot is busy so dropdb should succeed
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb2]);
+ok(1, 'dropdb finished');
+
+while ($node_replica->slot('dodropslot2')->{'active_pid'})
+{
+	sleep(1);
+	print "waiting for walsender to exit";
+}
+
+print "walsender exited, waiting for pg_recvlogical to exit";
+
+# our client should've terminated in response to the walsender error
+eval {
+	$handle->finish;
+};
+$return = $?;
+if ($return) {
+	is($return, 256, "pg_recvlogical terminated by server");
+	like($stderr, qr/terminating connection due to conflict with recovery/, 'recvlogical recovery conflict');
+	like($stderr, qr/User was connected to a database that must be dropped./, 'recvlogical recovery conflict db');
+}
+
+is($node_replica->slot('dodropslot2')->{'active_pid'}, '', 'walsender backend exited');
+
+# The slot should be dropped by recovery now
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres', q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb2')]), 'f',
+  'database dropped on standby');
+
+is($node_replica->slot('dodropslot2')->{'slot_type'}, '', 'slot on standby dropped');
-- 
2.1.4

#61

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#58)

Re: Minimal logical decoding on standbys

On Fri, Jun 21, 2019 at 11:50 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

This definitely needs to be expanded, and follow the message style
guideline.

This message , with the v8 patch, looks like this :
ereport(LOG,
(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
errdetail("%s", reason)));
where reason is a char string.

That does not follow the message style guideline.

https://www.postgresql.org/docs/12/error-style-guide.html

From the grammar and punctuation section:

"Primary error messages: Do not capitalize the first letter. Do not
end a message with a period. Do not even think about ending a message
with an exclamation point.

Detail and hint messages: Use complete sentences, and end each with a
period. Capitalize the first word of sentences. Put two spaces after
the period if another sentence follows (for English text; might be
inappropriate in other languages)."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Robert Haas (#61)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Tue, 25 Jun 2019 at 19:14, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 21, 2019 at 11:50 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

This definitely needs to be expanded, and follow the message style
guideline.

This message , with the v8 patch, looks like this :
ereport(LOG,
(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
errdetail("%s", reason)));
where reason is a char string.

That does not follow the message style guideline.

https://www.postgresql.org/docs/12/error-style-guide.html

From the grammar and punctuation section:

"Primary error messages: Do not capitalize the first letter. Do not
end a message with a period. Do not even think about ending a message
with an exclamation point.

Detail and hint messages: Use complete sentences, and end each with a
period. Capitalize the first word of sentences. Put two spaces after
the period if another sentence follows (for English text; might be
inappropriate in other languages)."

Thanks. In the updated patch, changed the message style. Now it looks
like this :

primary message : dropped conflicting slot slot_name
error detail : Slot conflicted with xid horizon which was being
increased to 9012 (slot xmin: 1234, slot catalog_xmin: 5678).

--------------------

Also, in the updated patch (v11), I have added some scenarios that
verify that slot is dropped when either master wal_level is
insufficient, or when slot is conflicting. Also organized the test
file a bit.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v11.patchapplication/octet-stream; name=logical-decoding-on-standby_v11.patchDownload

From aa3004a70e1ab2ee304367b29dde1549326354f1 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Mon, 1 Jul 2019 10:49:50 +0530
Subject: [PATCH] Logical decoding on standby - v11

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

Changes in v6 patch :

While creating the slot, lastReplayedEndRecPtr is used to set the
restart_lsn, but its position is later adjusted in
DecodingContextFindStartpoint() in case it does not point to a
valid record location. This can happen because replay pointer
points to 1 + end of last record replayed, which means it can
coincide with first byte of a new WAL block, i.e. inside block
header.

Also, modified the test to handle the requirement that the
logical slot creation on standby requires a checkpoint
(or any other transaction commit) to be given from master. For
that, in src/test/perl/PostgresNode.pm, added a new function
create_logical_slot_on_standby() which does the reqiured steps.

Changes in v7 patch :

Merge the two conflict messages for xmin and catalog_xmin into
a single one.

Changes in v8 :

Fix incorrect flush ptr on standby (reported by Tushar Ahuja).
In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
point. On standby, GetFlushRecPtr() does not give a valid value, so it
was wrongly determined that the sent record is beyond flush point, as
a result of which, WalSndCaughtUp was set to true, causing
WalSndLoop() to sleep for some duration after every record.
This was reported by Tushar Ahuja, where pg_recvlogical seems like it
is hanging when there are loads of insert.
Fix: Use GetStandbyFlushRecPtr() if am_cascading_walsender

Changes in v9 :
While dropping a conflicting logical slot, if a backend has acquired it, send
it a conflict recovery signal. Check new function ReplicationSlotDropConflicting().
Also, miscellaneous review comments addressed, but not all of them yet.

Changes in v10 :
Adjust restart_lsn if it's a Replay Pointer.
This was earlier done in DecodingContextFindStartpoint() but now it
is done in in ReplicationSlotReserveWal(), when restart_lsn is initialized.

Changes in v11 :
Added some test scenarios to test drop-slot conflicts. Organized the
test file a bit.
Also improved the conflict error message.
---
 src/backend/access/gist/gistxlog.c                 |   6 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   4 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   2 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  22 ++
 src/backend/postmaster/pgstat.c                    |   4 +
 src/backend/replication/logical/decode.c           |  14 +-
 src/backend/replication/logical/logical.c          |  33 +-
 src/backend/replication/slot.c                     | 233 +++++++++++-
 src/backend/replication/walsender.c                |   8 +-
 src/backend/storage/ipc/procarray.c                |   4 +
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/tcop/postgres.c                        |  23 +-
 src/backend/utils/adt/pgstatfuncs.c                |   1 +
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/pgstat.h                               |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   1 +
 src/test/perl/PostgresNode.pm                      |  27 ++
 .../recovery/t/018_logical_decoding_on_replica.pl  | 420 +++++++++++++++++++++
 36 files changed, 830 insertions(+), 58 deletions(-)
 create mode 100644 src/test/recovery/t/018_logical_decoding_on_replica.pl

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..385ea1f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b..10b7857 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7240,6 +7242,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7270,7 +7273,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7280,6 +7283,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7700,7 +7704,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7736,7 +7741,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7832,7 +7838,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7969,7 +7977,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 0357030..6b641c9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1140,6 +1142,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..b874bda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..eaaf631 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e08320e..7417bcf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4926,6 +4926,15 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9843,6 +9852,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data. Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have either disallowed creation of logical slots or dropped
+		 * existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b4f2b28..797ea0c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4728,6 +4728,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6352,6 +6353,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/* Cannot proceed if master itself does not have logical data */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bbd38c0..4169828 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,23 +94,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -241,6 +240,8 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 55c306e..47c7dd8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -101,6 +102,7 @@ int			max_replication_slots = 0;	/* the maximum number of replication
 
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -638,6 +640,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 }
 
 /*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+/*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
  */
@@ -1016,37 +1076,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1065,6 +1144,122 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropPtr() would acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 92fa86f..4ce7096 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2814,6 +2814,7 @@ XLogSendLogical(void)
 {
 	XLogRecord *record;
 	char	   *errm;
+	XLogRecPtr	flushPtr;
 
 	/*
 	 * Don't know whether we've caught up yet. We'll set WalSndCaughtUp to
@@ -2830,10 +2831,11 @@ XLogSendLogical(void)
 	if (errm != NULL)
 		elog(ERROR, "%s", errm);
 
+	flushPtr = (am_cascading_walsender ?
+				GetStandbyFlushRecPtr() : GetFlushRecPtr());
+
 	if (record != NULL)
 	{
-		/* XXX: Note that logical decoding cannot be used while in recovery */
-		XLogRecPtr	flushPtr = GetFlushRecPtr();
 
 		/*
 		 * Note the lack of any call to LagTrackerWrite() which is handled by
@@ -2857,7 +2859,7 @@ XLogSendLogical(void)
 		 * If the record we just wanted read is at or beyond the flushed
 		 * point, then we're caught up.
 		 */
-		if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
+		if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
 		{
 			WalSndCaughtUp = true;
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 18a0f62..ec696f4 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2669,6 +2669,10 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7605b2c..645f320 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -286,6 +286,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25b7e31..7cfb6d5 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1..c23d361 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2393,6 +2393,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
@@ -2920,7 +2942,6 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
-
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 05240bf..7dfbef7 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 						  dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index c13c08a..bd35bc1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 237f4e0..e7439c1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a3ad3a..4fe8684 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -604,6 +604,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea..73b954e 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 05b186a..956d3c2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -39,6 +39,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33ab..8c90fd7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37..719837d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2000,6 +2000,33 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+	sleep(1);
+
+	# Slot creation on standby waits for an xl_running_xacts record. So arrange
+	# for it.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	return 0;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/018_logical_decoding_on_replica.pl b/src/test/recovery/t/018_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..fd77e19
--- /dev/null
+++ b/src/test/recovery/t/018_logical_decoding_on_replica.pl
@@ -0,0 +1,420 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 58;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+my $node_master = get_new_node('master');
+my $node_replica = get_new_node('replica');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('master_physical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+sub create_logical_slots
+{
+	is($node_replica->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb'),
+	   0, 'created dropslot on testdb')
+		or BAIL_OUT('cannot continue if slot creation fails, see logs');
+	is($node_replica->slot('dropslot')->{'slot_type'}, 'logical', 'dropslot on standby created');
+	is($node_replica->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb'),
+	   0, 'created activeslot on testdb')
+		or BAIL_OUT('cannot continue if slot creation fails, see logs');
+	is($node_replica->slot('activeslot')->{'slot_type'}, 'logical', 'activeslot on standby created');
+
+	return 0;
+}
+
+sub make_slot_active
+{
+	# make sure activeslot is in use
+	print "starting pg_recvlogical";
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_replica->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return 0;
+}
+
+sub check_slots_dropped
+{
+	is($node_replica->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_replica->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero ");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+# Initialize master node
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('master_physical');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=master_physical');
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+# Initialize slave node
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'master_physical']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin,
+		"logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after
+# pg_logical_slot_get_changes
+($new_logical_xmin, $new_logical_catalog_xmin) =
+	wait_for_xmins($node_replica, 'standby_logical',
+				   "catalog_xmin::varchar::int > ${logical_catalog_xmin}");
+is($new_logical_xmin, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+my ($new_physical_xmin, $new_physical_catalog_xmin) =
+	wait_for_xmins($node_master, 'master_physical',
+				   "catalog_xmin::varchar::int > ${physical_catalog_xmin}");
+isnt($new_physical_xmin, '', "physical xmin not null");
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin,
+	'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+
+make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped();
+
+# Turn hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+##################################################
+# Recovery: drop database drops slots, including active slots.
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB.
+create_logical_slots();
+
+make_slot_active();
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+is($node_replica->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'),
+   0, 'created otherslot on postgres')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped();
+
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_replica->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.1.4

#63

tushar

tushar.ahuja@enterprisedb.com

over 6 years ago

In reply to: Amit Khandekar (#62)

Re: Minimal logical decoding on standbys

On 07/01/2019 11:04 AM, Amit Khandekar wrote:

Also, in the updated patch (v11), I have added some scenarios that
verify that slot is dropped when either master wal_level is
insufficient, or when slot is conflicting. Also organized the test
file a bit.

One scenario where replication slot removed even after fixing the
problem (which Error message suggested to do)

Please refer this below scenario

Master cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5432

Standby cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5433

both Master/Slave cluster are up and running and are in SYNC with each other
Create a logical replication slot on SLAVE ( SELECT * from
pg_create_logical_replication_slot('m', 'test_decoding'); )

change wal_level='hot_standby' on Master postgresql.conf file / restart
the server
Run get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
ERROR: logical decoding on standby requires wal_level >= logical on master

Correct it on Master postgresql.conf file ,i.e set wal_level='logical'
again / restart the server
and again fire get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
*ERROR: replication slot "m" does not exist

*This looks little weird as slot got dropped/removed internally . i
guess it should get invalid rather than removed automatically.
Lets user's delete the slot themself rather than automatically removed
as a surprise.

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#64

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: tushar (#63)

Re: Minimal logical decoding on standbys

On Thu, 4 Jul 2019 at 15:52, tushar <tushar.ahuja@enterprisedb.com> wrote:

On 07/01/2019 11:04 AM, Amit Khandekar wrote:

Also, in the updated patch (v11), I have added some scenarios that
verify that slot is dropped when either master wal_level is
insufficient, or when slot is conflicting. Also organized the test
file a bit.

One scenario where replication slot removed even after fixing the problem (which Error message suggested to do)

Which specific problem are you referring to ? Removing a conflicting
slot, itself is the part of the fix for the conflicting slot problem.

Please refer this below scenario

Master cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5432

Standby cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5433

both Master/Slave cluster are up and running and are in SYNC with each other
Create a logical replication slot on SLAVE ( SELECT * from pg_create_logical_replication_slot('m', 'test_decoding'); )

change wal_level='hot_standby' on Master postgresql.conf file / restart the server
Run get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
ERROR: logical decoding on standby requires wal_level >= logical on master

Correct it on Master postgresql.conf file ,i.e set wal_level='logical' again / restart the server
and again fire get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
ERROR: replication slot "m" does not exist

This looks little weird as slot got dropped/removed internally . i guess it should get invalid rather than removed automatically.
Lets user's delete the slot themself rather than automatically removed as a surprise.

It was earlier discussed about what action should be taken when we
find conflicting slots. Out of the options, one was to drop the slot,
and we chose that because that was simple. See this :
/messages/by-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de

By the way, you are getting the "logical decoding on standby requires
wal_level >= logical on master" error while using the slot, which is
because we reject the command even before checking the existence of
the slot. Actually the slot is already dropped due to master
wal_level. Then when you correct the master wal_level, the command is
not rejected, and proceeds to use the slot and then it is found that
the slot does not exist.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#65

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#64)

Re: Minimal logical decoding on standbys

On Thu, 4 Jul 2019 at 17:21, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Thu, 4 Jul 2019 at 15:52, tushar <tushar.ahuja@enterprisedb.com> wrote:

On 07/01/2019 11:04 AM, Amit Khandekar wrote:

Also, in the updated patch (v11), I have added some scenarios that
verify that slot is dropped when either master wal_level is
insufficient, or when slot is conflicting. Also organized the test
file a bit.

One scenario where replication slot removed even after fixing the problem (which Error message suggested to do)

Which specific problem are you referring to ? Removing a conflicting
slot, itself is the part of the fix for the conflicting slot problem.

Please refer this below scenario

Master cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5432

Standby cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5433

both Master/Slave cluster are up and running and are in SYNC with each other
Create a logical replication slot on SLAVE ( SELECT * from pg_create_logical_replication_slot('m', 'test_decoding'); )

change wal_level='hot_standby' on Master postgresql.conf file / restart the server
Run get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
ERROR: logical decoding on standby requires wal_level >= logical on master

Correct it on Master postgresql.conf file ,i.e set wal_level='logical' again / restart the server
and again fire get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
ERROR: replication slot "m" does not exist

This looks little weird as slot got dropped/removed internally . i guess it should get invalid rather than removed automatically.
Lets user's delete the slot themself rather than automatically removed as a surprise.

It was earlier discussed about what action should be taken when we
find conflicting slots. Out of the options, one was to drop the slot,
and we chose that because that was simple. See this :
/messages/by-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de

Sorry, the above link is not the one I wanted to refer to. Correct one is this :

/messages/by-id/20181214005521.jaty2d24lz4nroil@alap3.anarazel.de

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#66

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Khandekar (#62)

Re: Minimal logical decoding on standbys

Hi,

Thanks for the new version! Looks like we're making progress towards
something committable here.

I think it'd be good to split the patch into a few pieces. I'd maybe do
that like:
1) WAL format changes (plus required other changes)
2) Recovery conflicts with slots
3) logical decoding on standby
4) tests

@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
*/

/* XLOG stuff */
+ xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
xlrec_reuse.node = rel->rd_node;
xlrec_reuse.block = blkno;
xlrec_reuse.latestRemovedXid = latestRemovedXid;

Hm. I think we otherwise only ever use
RelationIsAccessibleInLogicalDecoding() on tables, not on indexes. And
while I think this would mostly work for builtin catalog tables, it
won't work for "user catalog tables" as RelationIsUsedAsCatalogTable()
won't perform any useful checks for indexes.

So I think we either need to look up the table, or pass it down.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b..10b7857 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
* see comments for vacuum_log_cleanup_info().
*/
XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
{
xl_heap_cleanup_info xlrec;
XLogRecPtr	recptr;

-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
xlrec.latestRemovedXid = latestRemovedXid;

XLogBeginInsert();
@@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));

+ xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);

It'd probably be a good idea to add a comment to
RelationIsUsedAsCatalogTable() that it better never invoke anything
performing catalog accesses. Otherwise there's quite the danger with
recursion (some operation doing RelationIsAccessibleInLogicalDecoding(),
that then accessing the catalog, which in turn could again need to
perform said operation, loop).

/* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
OffsetNumber itemnos[MaxIndexTuplesPerPage];
spgxlogVacuumRedirect xlrec;

+ xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
xlrec.nToPlaceholder = 0;
xlrec.newestRedirectXid = InvalidTransactionId;

We should document that it is safe to do catalog acceses here, because
spgist is never used to back catalogs. Otherwise there would be an a
endless recursion danger here.

Did you check how hard it we to just pass down the heap relation?

/*
+ * Get the wal_level from the control file.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}

What does "Active" mean here? I assume it's supposed to indicate that it
could be different than what's configured in postgresql.conf, for a
replica? If so, that should be mentioned.

+/*
* Initialization of shared memory for XLOG
*/
Size
@@ -9843,6 +9852,19 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
+		/*
+		 * Drop logical slots if we are in hot standby and master does not have
+		 * logical data.

nitpick: s/master/the primary/ (mostly adding the "the", but I
personally also prefer primary over master)

s/logical data/a WAL level sufficient for logical decoding/

Don't bother to search for the slots if standby is
+		 * running with wal_level lower than logical, because in that case,
+		 * we would have either disallowed creation of logical slots or dropped
+		 * existing ones.

s/Don't bother/No need/
s/slots/potentially conflicting logically slots/

+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+			/* Cannot proceed if master itself does not have logical data */

This needs an explanation as to how this is reachable...

+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;

Hm, this strikes me as a not quite good enough error message (same in
other copies of the message). Perhaps something roughly like "could not
continue with logical decoding, the primary's wal level is now too low
(%u)"?

if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
}

/*
@@ -241,6 +240,8 @@ CreateInitDecodingContext(char *plugin,
LogicalDecodingContext *ctx;
MemoryContext old_context;

+ CheckLogicalDecodingRequirements();
+

This should reference the above explanation.

/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)

+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}

I think this is going to be a nightmare for translators, no? I'm not
clear as to why any of this is needed?

+			/* ReplicationSlotDropPtr() would acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);

"would acquire"? I think it *does* acquire, right?

@@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}

Huh, I'm not following as to why that's needed for walsenders?

@@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
dbentry->n_conflict_tablespace +
dbentry->n_conflict_lock +
dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
dbentry->n_conflict_bufferpin +
dbentry->n_conflict_startup_deadlock);

I think this probably needs adjustments in a few more places,
e.g. monitoring.sgml...

Thanks!

Andres Freund

#67

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Andres Freund (#66)

Re: Minimal logical decoding on standbys

On Tue, Jul 9, 2019 at 11:14 PM Andres Freund <andres@anarazel.de> wrote:

+                     if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+                             ereport(ERROR,
+                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                              errmsg("logical decoding on standby requires "
+                                                             "wal_level >= logical on master")));
+                     break;
Hm, this strikes me as a not quite good enough error message (same in
other copies of the message). Perhaps something roughly like "could not
continue with logical decoding, the primary's wal level is now too low
(%u)"?

For what it's worth, I dislike that wording on grammatical grounds --
it sounds like two complete sentences joined by a comma, which is poor
style -- and think Amit's wording is probably fine. We could fix the
grammatical issue by replacing the comma in your version with the word
"because," but that seems unnecessarily wordy to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#68

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#66)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Wed, 10 Jul 2019 at 08:44, Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for the new version! Looks like we're making progress towards
something committable here.

I think it'd be good to split the patch into a few pieces. I'd maybe do
that like:
1) WAL format changes (plus required other changes)
2) Recovery conflicts with slots
3) logical decoding on standby
4) tests

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.

@@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
*/

/* XLOG stuff */
+ xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
xlrec_reuse.node = rel->rd_node;
xlrec_reuse.block = blkno;
xlrec_reuse.latestRemovedXid = latestRemovedXid;

Hm. I think we otherwise only ever use
RelationIsAccessibleInLogicalDecoding() on tables, not on indexes. And
while I think this would mostly work for builtin catalog tables, it
won't work for "user catalog tables" as RelationIsUsedAsCatalogTable()
won't perform any useful checks for indexes.

So I think we either need to look up the table, or pass it down.

Done. Passed down the heap rel.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b..10b7857 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
* see comments for vacuum_log_cleanup_info().
*/
XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
{
xl_heap_cleanup_info xlrec;
XLogRecPtr      recptr;
-     xlrec.node = rnode;
+     xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+     xlrec.node = rel->rd_node;
xlrec.latestRemovedXid = latestRemovedXid;
XLogBeginInsert();
@@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));

+ xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
It'd probably be a good idea to add a comment to
RelationIsUsedAsCatalogTable() that it better never invoke anything
performing catalog accesses. Otherwise there's quite the danger with
recursion (some operation doing RelationIsAccessibleInLogicalDecoding(),
that then accessing the catalog, which in turn could again need to
perform said operation, loop).

Added comments in RelationIsUsedAsCatalogTable() as well as
RelationIsAccessibleInLogicalDecoding() :

* RelationIsAccessibleInLogicalDecoding
* True if we need to log enough information to have access via
* decoding snapshot.
* This definition should not invoke anything that performs catalog
* access. Otherwise, e.g. logging a WAL entry for catalog relation may
* invoke this function, which will in turn do catalog access, which may
* in turn cause another similar WAL entry to be logged, leading to
* infinite recursion.

/* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
OffsetNumber itemnos[MaxIndexTuplesPerPage];
spgxlogVacuumRedirect xlrec;

+ xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
xlrec.nToPlaceholder = 0;
xlrec.newestRedirectXid = InvalidTransactionId;

We should document that it is safe to do catalog acceses here, because
spgist is never used to back catalogs. Otherwise there would be an a
endless recursion danger here.

Comments added.

Did you check how hard it we to just pass down the heap relation?

It does look hard. Check my comments in an earlier reply, that I have
pasted below :

This one seems harder, but I'm not actually sure why we make it so
hard. It seems like we just ought to add the table to IndexVacuumInfo.

/*
+ * Get the wal_level from the control file.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+     return ControlFile->wal_level;
+}
What does "Active" mean here? I assume it's supposed to indicate that it
could be different than what's configured in postgresql.conf, for a
replica? If so, that should be mentioned.

Done. Here are the new comments :
* Get the wal_level from the control file. For a standby, this value should be
* considered as its active wal_level, because it may be different from what
* was originally configured on standby.

+/*
* Initialization of shared memory for XLOG
*/
Size
@@ -9843,6 +9852,19 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
+             /*
+              * Drop logical slots if we are in hot standby and master does not have
+              * logical data.
nitpick: s/master/the primary/ (mostly adding the "the", but I
personally also prefer primary over master)

s/logical data/a WAL level sufficient for logical decoding/
Don't bother to search for the slots if standby is
+              * running with wal_level lower than logical, because in that case,
+              * we would have either disallowed creation of logical slots or dropped
+              * existing ones.
s/Don't bother/No need/
s/slots/potentially conflicting logically slots/

Done.

+             if (InRecovery && InHotStandby &&
+                     xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+                     wal_level >= WAL_LEVEL_LOGICAL)
+                     ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+                             gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..c1bd028 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+             case XLOG_PARAMETER_CHANGE:
+             {
+                     xl_parameter_change *xlrec =
+                             (xl_parameter_change *) XLogRecGetData(buf->record);
+                     /* Cannot proceed if master itself does not have logical data */

This needs an explanation as to how this is reachable...

Done. Here are the comments :
* If wal_level on primary is reduced to less than logical, then we
* want to prevent existing logical slots from being used.
* Existing logical slot on standby gets dropped when this WAL
* record is replayed; and further, slot creation fails when the
* wal level is not sufficient; but all these operations are not
* synchronized, so a logical slot may creep in while the wal_level
* is being reduced. Hence this extra check.

+                     if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+                             ereport(ERROR,
+                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                              errmsg("logical decoding on standby requires "
+                                                             "wal_level >= logical on master")));
+                     break;
Hm, this strikes me as a not quite good enough error message (same in
other copies of the message). Perhaps something roughly like "could not
continue with logical decoding, the primary's wal level is now too low
(%u)"?

Haven't changed this. There is another reply from Robert. I think what
you want to emphasize is that we can't *continue*. I am not sure why
user can't infer that the "logical decoding could not continue" when
we say "logical decoding requires wal_level >= ...."

if (RecoveryInProgress())
-             ereport(ERROR,
-                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                              errmsg("logical decoding cannot be used while in recovery")));
+     {
+             /*
+              * This check may have race conditions, but whenever
+              * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+              * verify that there are no existing logical replication slots. And to
+              * avoid races around creating a new slot,
+              * CheckLogicalDecodingRequirements() is called once before creating
+              * the slot, and once when logical decoding is initially starting up.
+              */
+             if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+                     ereport(ERROR,
+                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                      errmsg("logical decoding on standby requires "
+                                                     "wal_level >= logical on master")));
+     }
}

/*
@@ -241,6 +240,8 @@ CreateInitDecodingContext(char *plugin,
LogicalDecodingContext *ctx;
MemoryContext old_context;

+     CheckLogicalDecodingRequirements();
+

This should reference the above explanation.

Done.

/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)

+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+                                                                             char *conflict_reason)
+{
+                     /*
+                      * Build the conflict_str which will look like :
+                      * "Slot conflicted with xid horizon which was being increased
+                      * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+                      */
+                     initStringInfo(&conflict_xmins);
+                     if (TransactionIdIsValid(slot_xmin) &&
+                             TransactionIdPrecedesOrEquals(slot_xmin, xid))
+                     {
+                             appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+                     }
+                     if (TransactionIdIsValid(slot_catalog_xmin) &&
+                             TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+                             appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+                                                              conflict_xmins.len > 0 ? ", " : "",
+                                                              slot_catalog_xmin);
+
+                     if (conflict_xmins.len > 0)
+                     {
+                             initStringInfo(&conflict_str);
+                             appendStringInfo(&conflict_str, "%s %d (%s).",
+                                                              conflict_sentence, xid, conflict_xmins.data);
+                             found_conflict = true;
+                             conflict_reason = conflict_str.data;
+                     }
+             }

I think this is going to be a nightmare for translators, no?

For translators, I think the .po files will have the required text,
because I have used gettext_noop() for both conflict_sentence and the
passed in conflict_reason parameter. And the "dropped conflicting
slot." is passed to ereport() as usual. The rest portion of errdetail
is not language specific. E.g. "slot" remains "slot".

I'm not clear as to why any of this is needed?

The conflict can happen for either xmin or catalog_xmin or both, right
? The purpose of the above is to show only conflicting xmin out of the
two.

+                     /* ReplicationSlotDropPtr() would acquire the lock below */
+                     LWLockRelease(ReplicationSlotControlLock);

"would acquire"? I think it *does* acquire, right?

Yes, Changed to "will".

@@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+                     case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+                             /*
+                              * For conflicts that require a logical slot to be dropped, the
+                              * requirement is for the signal receiver to release the slot,
+                              * so that it could be dropped by the signal sender. So for
+                              * normal backends, the transaction should be aborted, just
+                              * like for other recovery conflicts. But if it's walsender on
+                              * standby, then it has to be killed so as to release an
+                              * acquired logical slot.
+                              */
+                             if (am_cascading_walsender &&
+                                     reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+                                     MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+                             {
+                                     RecoveryConflictPending = true;
+                                     QueryCancelPending = true;
+                                     InterruptPending = true;
+                                     break;
+                             }

Huh, I'm not following as to why that's needed for walsenders?

For normal backends, we ignore this signal if we aren't in a
transaction (block). But for walsender, there is no transaction, but
we cannot ignore the signal. This is because walsender can keep a
logical slot acquired when it was spawned by "pg_recvlogical --start".
So we can't ignore the signal. So the only way that we can make it
release the acquired slot is to kill it.

@@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
dbentry->n_conflict_tablespace +
dbentry->n_conflict_lock +
dbentry->n_conflict_snapshot +
+                                               dbentry->n_conflict_logicalslot +
dbentry->n_conflict_bufferpin +
dbentry->n_conflict_startup_deadlock);

I think this probably needs adjustments in a few more places,
e.g. monitoring.sgml...

Oops, yeah, to search for similar additions, I had looked for
"conflict_snapshot" using cscope. I should have done the same using
"git grep".
Done now.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logical-decoding-on-standby_v12.patchapplication/octet-stream; name=logical-decoding-on-standby_v12.patchDownload

From 3dbe81d332f7145fd356957f9b4609e8d2e97b24 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 10 Jul 2019 16:55:19 +0530
Subject: [PATCH] Logical decoding on standby - v12

Author : Andres Freund.

Besides the above main changes, patch includes following :

1. Handle slot conflict recovery by dropping the conflicting slots.

-Amit Khandekar.

2. test/recovery/t/016_logical_decoding_on_replica.pl added.

Original author : Craig Ringer. few changes/additions from Amit Khandekar.

3. Handle slot conflicts when master wal_level becomes less than logical.

Changes in v6 patch :

While creating the slot, lastReplayedEndRecPtr is used to set the
restart_lsn, but its position is later adjusted in
DecodingContextFindStartpoint() in case it does not point to a
valid record location. This can happen because replay pointer
points to 1 + end of last record replayed, which means it can
coincide with first byte of a new WAL block, i.e. inside block
header.

Also, modified the test to handle the requirement that the
logical slot creation on standby requires a checkpoint
(or any other transaction commit) to be given from master. For
that, in src/test/perl/PostgresNode.pm, added a new function
create_logical_slot_on_standby() which does the reqiured steps.

Changes in v7 patch :

Merge the two conflict messages for xmin and catalog_xmin into
a single one.

Changes in v8 :

Fix incorrect flush ptr on standby (reported by Tushar Ahuja).
In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
point. On standby, GetFlushRecPtr() does not give a valid value, so it
was wrongly determined that the sent record is beyond flush point, as
a result of which, WalSndCaughtUp was set to true, causing
WalSndLoop() to sleep for some duration after every record.
This was reported by Tushar Ahuja, where pg_recvlogical seems like it
is hanging when there are loads of insert.
Fix: Use GetStandbyFlushRecPtr() if am_cascading_walsender

Changes in v9 :
While dropping a conflicting logical slot, if a backend has acquired it, send
it a conflict recovery signal. Check new function ReplicationSlotDropConflicting().
Also, miscellaneous review comments addressed, but not all of them yet.

Changes in v10 :
Adjust restart_lsn if it's a Replay Pointer.
This was earlier done in DecodingContextFindStartpoint() but now it
is done in in ReplicationSlotReserveWal(), when restart_lsn is initialized.

Changes in v11 :
Added some test scenarios to test drop-slot conflicts. Organized the
test file a bit.
Also improved the conflict error message.

Changes in v12 :
Review comments addressed.
---
 doc/src/sgml/monitoring.sgml                       |   6 +
 src/backend/access/gist/gist.c                     |   2 +-
 src/backend/access/gist/gistbuild.c                |   2 +-
 src/backend/access/gist/gistutil.c                 |   4 +-
 src/backend/access/gist/gistxlog.c                 |   9 +-
 src/backend/access/hash/hash_xlog.c                |   3 +-
 src/backend/access/hash/hashinsert.c               |   2 +
 src/backend/access/heap/heapam.c                   |  23 +-
 src/backend/access/heap/vacuumlazy.c               |   2 +-
 src/backend/access/heap/visibilitymap.c            |   2 +-
 src/backend/access/nbtree/nbtpage.c                |   4 +
 src/backend/access/nbtree/nbtxlog.c                |   4 +-
 src/backend/access/spgist/spgvacuum.c              |   8 +
 src/backend/access/spgist/spgxlog.c                |   1 +
 src/backend/access/transam/xlog.c                  |  25 ++
 src/backend/catalog/system_views.sql               |   1 +
 src/backend/postmaster/pgstat.c                    |   4 +
 src/backend/replication/logical/decode.c           |  22 +-
 src/backend/replication/logical/logical.c          |  37 +-
 src/backend/replication/slot.c                     | 233 +++++++++++-
 src/backend/replication/walsender.c                |   8 +-
 src/backend/storage/ipc/procarray.c                |   4 +
 src/backend/storage/ipc/procsignal.c               |   3 +
 src/backend/storage/ipc/standby.c                  |   7 +-
 src/backend/tcop/postgres.c                        |  23 +-
 src/backend/utils/adt/pgstatfuncs.c                |  16 +
 src/backend/utils/cache/lsyscache.c                |  16 +
 src/include/access/gist_private.h                  |   6 +-
 src/include/access/gistxlog.h                      |   3 +-
 src/include/access/hash_xlog.h                     |   1 +
 src/include/access/heapam_xlog.h                   |   8 +-
 src/include/access/nbtxlog.h                       |   2 +
 src/include/access/spgxlog.h                       |   1 +
 src/include/access/xlog.h                          |   1 +
 src/include/catalog/pg_proc.dat                    |   5 +
 src/include/pgstat.h                               |   1 +
 src/include/replication/slot.h                     |   2 +
 src/include/storage/procsignal.h                   |   1 +
 src/include/storage/standby.h                      |   2 +-
 src/include/utils/lsyscache.h                      |   1 +
 src/include/utils/rel.h                            |   9 +
 src/test/perl/PostgresNode.pm                      |  27 ++
 .../recovery/t/018_logical_decoding_on_replica.pl  | 420 +++++++++++++++++++++
 src/test/regress/expected/rules.out                |   1 +
 44 files changed, 896 insertions(+), 66 deletions(-)
 create mode 100644 src/test/recovery/t/018_logical_decoding_on_replica.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf72d0c..42bfe82 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2678,6 +2678,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       old snapshots</entry>
     </row>
     <row>
+     <entry><structfield>confl_logicalslot</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of queries in this database that have been canceled due to
+      logical slots</entry>
+    </row>
+    <row>
      <entry><structfield>confl_bufferpin</structfield></entry>
      <entry><type>bigint</type></entry>
      <entry>Number of queries in this database that have been canceled due to
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121..af1bd13 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -339,7 +339,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ecef0ff..b5f59a1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -171,7 +171,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.giststate->tempCxt = createTempGistContext();
 
 	/* initialize the root page */
-	buffer = gistNewBuffer(index);
+	buffer = gistNewBuffer(heap, index);
 	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 	page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 49df056..1fcc7cb 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -807,7 +807,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -851,7 +851,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34..1f40f98 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -397,7 +398,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+											xlrec->onCatalogTable, xlrec->node);
 	}
 }
 
@@ -578,7 +579,8 @@ gistXLogPageDelete(Buffer buffer, TransactionId xid,
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, TransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -589,6 +591,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d7b7098..00c3e0f 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762..e28465a 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b..10b7857 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7240,6 +7242,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7270,7 +7273,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7280,6 +7283,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
@@ -7700,7 +7704,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7736,7 +7741,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7832,7 +7838,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7969,7 +7977,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..bf34d3a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -473,7 +473,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06..c5fdd64 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -281,7 +281,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 50455db..65c0f50 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
@@ -771,6 +772,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1138,6 +1140,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.nitems = nitems;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3147ea4..869dfda 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -526,7 +526,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -810,6 +811,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2b1662a..28dee96 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index ebe6ae8..800609c 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b6c9353..2f60967 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4927,6 +4927,17 @@ LocalProcessControlFile(bool reset)
 }
 
 /*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
+/*
  * Initialization of shared memory for XLOG
  */
 Size
@@ -9856,6 +9867,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ea4c85e..f3fad98 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -893,6 +893,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b4f2b28..797ea0c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4728,6 +4728,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6352,6 +6353,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef..abfa8e4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,11 +190,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9853be6..54d0424 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -94,23 +94,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -241,6 +240,12 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 62342a6..76d7277 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -101,6 +102,7 @@ int			max_replication_slots = 0;	/* the maximum number of replication
 
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -638,6 +640,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 }
 
 /*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
+/*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
  */
@@ -1016,37 +1076,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1065,6 +1144,122 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
+/*
  * Flush all replication slots to disk.
  *
  * This needn't actually be part of a checkpoint, but it's a convenient
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e7a59b0..a45098c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2814,6 +2814,7 @@ XLogSendLogical(void)
 {
 	XLogRecord *record;
 	char	   *errm;
+	XLogRecPtr	flushPtr;
 
 	/*
 	 * Don't know whether we've caught up yet. We'll set WalSndCaughtUp to
@@ -2830,10 +2831,11 @@ XLogSendLogical(void)
 	if (errm != NULL)
 		elog(ERROR, "%s", errm);
 
+	flushPtr = (am_cascading_walsender ?
+				GetStandbyFlushRecPtr() : GetFlushRecPtr());
+
 	if (record != NULL)
 	{
-		/* XXX: Note that logical decoding cannot be used while in recovery */
-		XLogRecPtr	flushPtr = GetFlushRecPtr();
 
 		/*
 		 * Note the lack of any call to LagTrackerWrite() which is handled by
@@ -2857,7 +2859,7 @@ XLogSendLogical(void)
 		 * If the record we just wanted read is at or beyond the flushed
 		 * point, then we're caught up.
 		 */
-		if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
+		if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
 		{
 			WalSndCaughtUp = true;
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ea02973..09c827b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2670,6 +2670,10 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7605b2c..645f320 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -286,6 +286,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25b7e31..7cfb6d5 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -291,7 +292,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -312,6 +314,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1..c23d361 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2393,6 +2393,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
@@ -2920,7 +2942,6 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
-
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 05240bf..547f9ab 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1456,6 +1456,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
+Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
 	Oid			dbid = PG_GETARG_OID(0);
@@ -1499,6 +1514,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 						  dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index c13c08a..bd35bc1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1893,6 +1895,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = heap_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	heap_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index f80694b..f772488 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -429,8 +429,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 TransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  TransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, TransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -468,7 +468,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a537..59246c3 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 53b682c..fd70b55 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -263,6 +263,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f6cdca8..a1d1f11 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..f64a33c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -126,6 +126,7 @@ typedef struct xl_btree_split
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			nitems;
 
@@ -139,6 +140,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 073f740..d3dad69 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252..72c8d33 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 604470c..81bbfcb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5259,6 +5259,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '3432',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a3ad3a..4fe8684 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -604,6 +604,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea..73b954e 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,4 +205,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 05b186a..956d3c2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -39,6 +39,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index a3f8f82..6dedebc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index c8df5bf..579d9ff 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -131,6 +131,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d35b4a5..2243236 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -309,6 +310,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -566,6 +570,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37..719837d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2000,6 +2000,33 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+	sleep(1);
+
+	# Slot creation on standby waits for an xl_running_xacts record. So arrange
+	# for it.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	return 0;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/018_logical_decoding_on_replica.pl b/src/test/recovery/t/018_logical_decoding_on_replica.pl
new file mode 100644
index 0000000..fd77e19
--- /dev/null
+++ b/src/test/recovery/t/018_logical_decoding_on_replica.pl
@@ -0,0 +1,420 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Test logical decoding on a standby.
+#
+use strict;
+use warnings;
+use 5.8.0;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 58;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $return);
+my $backup_name;
+
+my $node_master = get_new_node('master');
+my $node_replica = get_new_node('replica');
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+
+	my $slotinfo = $node->slot($slotname);
+	return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
+}
+
+sub print_phys_xmin
+{
+	my $slot = $node_master->slot('master_physical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+sub print_logical_xmin
+{
+	my $slot = $node_replica->slot('standby_logical');
+	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
+}
+
+sub create_logical_slots
+{
+	is($node_replica->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb'),
+	   0, 'created dropslot on testdb')
+		or BAIL_OUT('cannot continue if slot creation fails, see logs');
+	is($node_replica->slot('dropslot')->{'slot_type'}, 'logical', 'dropslot on standby created');
+	is($node_replica->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb'),
+	   0, 'created activeslot on testdb')
+		or BAIL_OUT('cannot continue if slot creation fails, see logs');
+	is($node_replica->slot('activeslot')->{'slot_type'}, 'logical', 'activeslot on standby created');
+
+	return 0;
+}
+
+sub make_slot_active
+{
+	# make sure activeslot is in use
+	print "starting pg_recvlogical";
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $node_replica->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_replica->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return 0;
+}
+
+sub check_slots_dropped
+{
+	is($node_replica->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_replica->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero ");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+# Initialize master node
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('master_physical');]);
+$backup_name = 'b1';
+my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
+TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--slot=master_physical');
+
+my ($xmin, $catalog_xmin) = print_phys_xmin();
+# After slot creation, xmins must be null
+is($xmin, '', "xmin null");
+is($catalog_xmin, '', "catalog_xmin null");
+
+# Initialize slave node
+$node_replica->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_replica->append_conf('postgresql.conf',
+	q[primary_slot_name = 'master_physical']);
+
+$node_replica->start;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# with hot_standby_feedback off, xmin and catalog_xmin must still be null
+($xmin, $catalog_xmin) = print_phys_xmin();
+is($xmin, '', "xmin null after replica join");
+is($catalog_xmin, '', "catalog_xmin null after replica join");
+
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the replica, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+is($node_replica->create_logical_slot_on_standby($node_master, 'standby_logical', 'testdb'),
+   0, 'logical slot creation on standby succeeded')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+($xmin, $catalog_xmin) = print_logical_xmin();
+is($xmin, '', "logical xmin null");
+isnt($catalog_xmin, '', "logical catalog_xmin not null");
+
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('testdb', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('testdb', 'DROP TABLE test_table');
+$node_master->safe_psql('testdb', 'VACUUM');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+($xmin, $catalog_xmin) = print_phys_xmin();
+isnt($xmin, '', "physical xmin not null");
+isnt($catalog_xmin, '', "physical catalog_xmin not null");
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_replica->psql('testdb', qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or BAIL_OUT('cannot continue if slot replay fails');
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+my ($physical_xmin, $physical_catalog_xmin) = print_phys_xmin();
+isnt($physical_xmin, '', "physical xmin not null");
+isnt($physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+my ($logical_xmin, $logical_catalog_xmin) = print_logical_xmin();
+is($logical_xmin, '', "logical xmin null");
+isnt($logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+# Ok, do a pile of tx's and make sure xmin advances.
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('testdb', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('testdb', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('testdb', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('testdb', 'VACUUM');
+
+my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+cmp_ok($new_logical_catalog_xmin, "==", $logical_catalog_xmin,
+		"logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_replica->psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_get_changes('standby_logical', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after
+# pg_logical_slot_get_changes
+($new_logical_xmin, $new_logical_catalog_xmin) =
+	wait_for_xmins($node_replica, 'standby_logical',
+				   "catalog_xmin::varchar::int > ${logical_catalog_xmin}");
+is($new_logical_xmin, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+my ($new_physical_xmin, $new_physical_catalog_xmin) =
+	wait_for_xmins($node_master, 'master_physical',
+				   "catalog_xmin::varchar::int > ${physical_catalog_xmin}");
+isnt($new_physical_xmin, '', "physical xmin not null");
+cmp_ok($new_physical_catalog_xmin, "<=", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+#########################################################
+# Upstream oldestXid retention
+#########################################################
+
+sub test_oldest_xid_retention()
+{
+	# First burn some xids on the master in another DB, so we push the master's
+	# nextXid ahead.
+	foreach my $i (1 .. 100)
+	{
+		$node_master->safe_psql('postgres', 'SELECT txid_current()');
+	}
+
+	# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+	# past our needed xmin. The only way we have visibility into that is to force
+	# a checkpoint.
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+	foreach my $dbname ('template1', 'postgres', 'testdb', 'template0')
+	{
+		$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+	}
+	sleep(1);
+	$node_master->safe_psql('postgres', 'CHECKPOINT');
+	IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+		or die "pg_controldata failed with $?";
+	my @checkpoint = split('\n', $stdout);
+	my ($oldestXid, $nextXid) = ('', '', '');
+	foreach my $line (@checkpoint)
+	{
+		if ($line =~ qr/^Latest checkpoint's NextXID:\s+\d+:(\d+)/)
+		{
+			$nextXid = $1;
+		}
+		if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+		{
+			$oldestXid = $1;
+		}
+	}
+	die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+	my ($new_physical_xmin, $new_physical_catalog_xmin) = print_phys_xmin();
+	my ($new_logical_xmin, $new_logical_catalog_xmin) = print_logical_xmin();
+
+	print "upstream oldestXid $oldestXid, nextXid $nextXid, phys slot catalog_xmin $new_physical_catalog_xmin, downstream catalog_xmin $new_logical_catalog_xmin";
+
+	$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+	return ($oldestXid);
+}
+
+my ($oldestXid) = test_oldest_xid_retention();
+
+cmp_ok($oldestXid, "<=", $new_logical_catalog_xmin,
+	'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+##################################################
+# Drop slot
+##################################################
+#
+is($node_replica->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+# Make sure slots on replicas are droppable, and properly clear the upstream's xmin
+$node_replica->psql('testdb', q[SELECT pg_drop_replication_slot('standby_logical')]);
+
+is($node_replica->slot('standby_logical')->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_replica->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NULL AND catalog_xmin IS NULL");
+
+make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped();
+
+# Turn hot_standby_feedback back on
+$node_replica->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_replica->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+($xmin, $catalog_xmin) = wait_for_xmins($node_master, 'master_physical',
+	"xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+##################################################
+# Recovery: drop database drops slots, including active slots.
+##################################################
+
+# Create a couple of slots on the DB to ensure they are dropped when we drop
+# the DB.
+create_logical_slots();
+
+make_slot_active();
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+is($node_replica->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'),
+   0, 'created otherslot on postgres')
+	or BAIL_OUT('cannot continue if slot creation fails, see logs');
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical', 'otherslot on standby created');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
+
+is($node_replica->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped();
+
+is($node_replica->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_replica->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 210e9cd..1a049a4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1838,6 +1838,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.1.4

#69

tushar

tushar.ahuja@enterprisedb.com

over 6 years ago

In reply to: Amit Khandekar (#68)

Re: Minimal logical decoding on standbys

On 07/10/2019 05:12 PM, Amit Khandekar wrote:

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.

Thanks Amit for your patch. i am able to see 1 issues on Standby server
- (where logical replication slot created ) ,
a)size of pg_wal folder is NOT decreasing even after firing
get_changes function
b)pg_wal files are not recycling and every time it is creating new
files after firing get_changes function

Here are the detailed steps -
create a directory with the name 'archive_dir' under /tmp (mkdir
/tmp/archive_dir)
*SR setup -*
*Master*
.)Perform initdb (./initdb -D master --wal-segsize=2)
.)Open postgresql.conf file and add these below parameters at the end
of file
wal_level='logical'
min_wal_size=4MB
max_wal_size=4MB
hot_standby_feedback = on
archive_mode=on
archive_command='cp %p /tmp/archive_dir/%f'
.)Start the server ( /pg_ctl -D master/ start -l logsM -c )
.)Connect to psql , create physical slot
->SELECT * FROM
pg_create_physical_replication_slot('decoding_standby');
*Standby - *
.)Perform pg_basebackup ( ./pg_basebackup -D standby/
--slot=decoding_standby -R -v)
.)Open postgresql.conf file of standby and add these 2 parameters - at
the end of file
port=5555
primary_slot_name = 'decoding_standby'
.)Start the Standby server ( ./pg_ctl -D standby/ start -l logsS -c )
.)Connect to psql terminal and create logical replication slot
->SELECT * from pg_create_logical_replication_slot('standby',
'test_decoding');

*MISC steps**-
*.)Connect to master and create table/insert rows ( create table t(n
int); insert into t (values (1);)
.)Connect to standby and fire get_changes function ( select * from
pg_logical_slot_get_changes('standby',null,null); )
.)Run pgbench ( ./pgbench -i -s 10 postgres)
.)Check the pg_wal directory size of STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M standby/pg_wal/
127M total
[centos@mail-arts bin]$

.)Connect to standby and fire get_changes function ( select * from
pg_logical_slot_get_changes('standby',null,null); )
.)Check the pg_wal directory size of STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M standby/pg_wal/
127M total
[centos@mail-arts bin]$

.)Restart both master and standby ( ./pg_ctl -D master restart -l logsM
-c) and (./pg_ctl -D standby restart -l logsS -c )

.)Check the pg_wal directory size of STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M standby/pg_wal/
127M total
[centos@mail-arts bin]$

and if we see the pg_wal files ,it is growing rampant and not reusing.

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#70

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: tushar (#69)

Re: Minimal logical decoding on standbys

Hi,

On 2019-07-12 14:53:21 +0530, tushar wrote:

On 07/10/2019 05:12 PM, Amit Khandekar wrote:

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.

Thanks Amit for your patch. i am able to see 1 issuesï¿½ on Standby server -
(whereï¿½ logical replication slot created ) ,
a)size ofï¿½ pg_wal folderï¿½ is NOT decreasing even after firing get_changes
function

Even after calling pg_logical_slot_get_changes() multiple times? What
does
SELECT * FROM pg_replication_slots; before and after multiple calls return?

Does manually forcing a checkpoint with CHECKPOINT; first on the primary
and then the standby "fix" the issue?

b)pg_wal files are not recyclingï¿½ and every time it is creating new files
after firing get_changes function

I'm not sure what you mean by this. Are you saying that
pg_logical_slot_get_changes() causes WAL to be written?

Greetings,

Andres Freund

#71

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Andres Freund (#70)

Re: Minimal logical decoding on standbys

On Tue, 16 Jul 2019 at 22:56, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-07-12 14:53:21 +0530, tushar wrote:

On 07/10/2019 05:12 PM, Amit Khandekar wrote:

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.

Thanks Amit for your patch. i am able to see 1 issues on Standby server -
(where logical replication slot created ) ,
a)size of pg_wal folder is NOT decreasing even after firing get_changes
function

Even after calling pg_logical_slot_get_changes() multiple times? What
does
SELECT * FROM pg_replication_slots; before and after multiple calls return?

Does manually forcing a checkpoint with CHECKPOINT; first on the primary
and then the standby "fix" the issue?

I independently tried to reproduce this issue on my machine yesterday.
I observed that :
sometimes, the files get cleaned up after two or more
pg_logical_slot_get_changes().
Sometimes, I have to restart the server to see the pg_wal files cleaned up.
This happens more or less the same even for logical slot on *primary*.

Will investigate further with Tushar.

b)pg_wal files are not recycling and every time it is creating new files
after firing get_changes function

I'm not sure what you mean by this. Are you saying that
pg_logical_slot_get_changes() causes WAL to be written?

Greetings,

Andres Freund

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#72

tushar

tushar.ahuja@enterprisedb.com

over 6 years ago

In reply to: Andres Freund (#70)

Re: Minimal logical decoding on standbys

On 07/16/2019 10:56 PM, Andres Freund wrote:

Even after calling pg_logical_slot_get_changes() multiple times? What
does
SELECT * FROM pg_replication_slots; before and after multiple calls return?

Does manually forcing a checkpoint with CHECKPOINT; first on the primary
and then the standby "fix" the issue?

Yes,eventually it gets clean up -after firing multiple times get_changes
function or checkpoint or even both.
This same behavior we are able to see on MASTER -with or without patch.

but is this an old (existing) issue ?

b)pg_wal files are not recycling and every time it is creating new files
after firing get_changes function

I'm not sure what you mean by this. Are you saying that
pg_logical_slot_get_changes() causes WAL to be written?

No, when i said - created new WAL files , i meant -after each pg_bench
run NOT after executing get_changes.

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#73

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#68)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Wed, 10 Jul 2019 at 17:12, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Wed, 10 Jul 2019 at 08:44, Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for the new version! Looks like we're making progress towards
something committable here.

I think it'd be good to split the patch into a few pieces. I'd maybe do
that like:
1) WAL format changes (plus required other changes)
2) Recovery conflicts with slots
3) logical decoding on standby
4) tests

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.

Attached are the split patches. Included is an additional patch that
has doc changes. Here is what I have added in the docs. Pasting it
here so that all can easily spot how it is supposed to behave, and to
confirm that we are all on the same page :

"A logical replication slot can also be created on a hot standby. To
prevent VACUUM from removing required rows from the system catalogs,
hot_standby_feedback should be set on the standby. In spite of that,
if any required rows get removed on standby, the slot gets dropped.
Existing logical slots on standby also get dropped if wal_level on
primary is reduced to less than 'logical'.

For a logical slot to be created, it builds a historic snapshot, for
which information of all the currently running transactions is
essential. On primary, this information is available, but on standby,
this information has to be obtained from primary. So, slot creation
may wait for some activity to happen on the primary. If the primary is
idle, creating a logical slot on standby may take a noticeable time."

#74

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Amit Khandekar (#73)

Re: Minimal logical decoding on standbys

On 2019-Jul-19, Amit Khandekar wrote:

Attached are the split patches. Included is an additional patch that
has doc changes. Here is what I have added in the docs. Pasting it
here so that all can easily spot how it is supposed to behave, and to
confirm that we are all on the same page :

... Apparently, this patch was not added to the commitfest for some
reason; and another patch that *is* in the commitfest has been said to
depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
which hasn't been updated in quite a while and has received no feedback
at all, not even from the listed reviewer Shaun Thomas). To make
matters worse, Amit's patchset no longer applies.

What I would like to do is add a link to this thread to CF's 1961 entry
and then update all these patches, in order to get things moving.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#75

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#74)

Re: Minimal logical decoding on standbys

On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Jul-19, Amit Khandekar wrote:

Attached are the split patches. Included is an additional patch that
has doc changes. Here is what I have added in the docs. Pasting it
here so that all can easily spot how it is supposed to behave, and to
confirm that we are all on the same page :

... Apparently, this patch was not added to the commitfest for some
reason; and another patch that *is* in the commitfest has been said to
depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
which hasn't been updated in quite a while and has received no feedback
at all, not even from the listed reviewer Shaun Thomas). To make
matters worse, Amit's patchset no longer applies.

What I would like to do is add a link to this thread to CF's 1961 entry
and then update all these patches, in order to get things moving.

Hi Alvaro,

Thanks for notifying about this. Will work this week on rebasing this
patchset and putting it into the 2019-11 commit fest.

#76

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#75)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Mon, 9 Sep 2019 at 16:06, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Jul-19, Amit Khandekar wrote:

Attached are the split patches. Included is an additional patch that
has doc changes. Here is what I have added in the docs. Pasting it
here so that all can easily spot how it is supposed to behave, and to
confirm that we are all on the same page :

... Apparently, this patch was not added to the commitfest for some
reason; and another patch that *is* in the commitfest has been said to
depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
which hasn't been updated in quite a while and has received no feedback
at all, not even from the listed reviewer Shaun Thomas). To make
matters worse, Amit's patchset no longer applies.

What I would like to do is add a link to this thread to CF's 1961 entry
and then update all these patches, in order to get things moving.

Hi Alvaro,

Thanks for notifying about this. Will work this week on rebasing this
patchset and putting it into the 2019-11 commit fest.

Rebased patch set attached.

Added in the Nov commitfest : https://commitfest.postgresql.org/25/2283/

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#77

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#76)

Re: Minimal logical decoding on standbys

On Fri, Sep 13, 2019 at 7:20 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Thanks for notifying about this. Will work this week on rebasing this
patchset and putting it into the 2019-11 commit fest.

Rebased patch set attached.

Added in the Nov commitfest : https://commitfest.postgresql.org/25/2283/

I took a bit of a look at
0004-New-TAP-test-for-logical-decoding-on-standby.patch and saw some
things I don't like in terms of general code quality:

- Not many comments. I think each set of tests should have a block
comment at the top explaining clearly what it's trying to test.
- print_phys_xmin and print_logical_xmin don't print anything.
- They are also identical to each other except that they each operate
on a different hard-coded slot name.
- They are also identical to wait_for_xmins except that they don't wait.
- create_logical_slots creates two slots whose names are hard-coded
using code that is cut-and-pasted.
- The same code is also cut-and-pasted into two other places in the file.
- Why does that cut-and-pasted code use BAIL_OUT(), which aborts the
entire test run, instead of die, which just aborts the current test
file?
- cmp_ok() message in check_slots_dropped() has trailing whitespace.
- make_slot_active() and check_slots_dropped(), at least, use global
variables; is that really necessary?
- In particular, $return is used only in one function and doesn't need
to survive across calls; why is it not a local variable?
- Depending on whether $return ends up true or false, the number of
executed tests will differ; so besides any actual test failures,
you'll get complaints about not executing exactly 58 tests.
- $backup_name only ever has one value, but for some reason the
variable is created at the top of the test file and then initialized
later. Just do my $backup_name = 'b1' near where it's first used, or
ditch the variable and write 'b1' in each of the three places it's
used.
- Some of the calls to wait_for_xmins() save the return values into
local variables but then do nothing with those values before they are
overwritten. Either it's wrong that we're saving them into local
variables, or it's wrong that we're not doing anything with them.
- test_oldest_xid_retention() is called only once; it basically acts
as a wrapper for one group of tests. You could argue against that
approach, but I actually think it's a nice style which makes the code
more self-documenting. However, it's not used consistently; all the
other groups of tests are written directly as toplevel code.
- The code in that function verifies that oldestXid is found in
pg_controldata's output, but does not check the same for NextXID.
- Is there a reason the code in that function prints debugging output?
Seems like a leftover.
- I think it might be an idea to move the tests for recovery
conflict/slot drop to a separate test file, so that we have one file
for the xmin-related testing and another for the recovery conflict
testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#78

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Robert Haas (#77)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Wed, 18 Sep 2019 at 19:34, Robert Haas <robertmhaas@gmail.com> wrote:

I took a bit of a look at
0004-New-TAP-test-for-logical-decoding-on-standby.patch and saw some
things I don't like in terms of general code quality:

- Not many comments. I think each set of tests should have a block
comment at the top explaining clearly what it's trying to test.

Done at initial couple of test groups so that the groups would be
spotted clearly. Please check.

- print_phys_xmin and print_logical_xmin don't print anything.
- They are also identical to each other except that they each operate
on a different hard-coded slot name.
- They are also identical to wait_for_xmins except that they don't wait.

Re-worked this part of the code. Now a single function
get_slot_xmins(slot_name) is used to return the slot's xmins's. It
figures out by the slot name, whether the slot belongs to master or
slave. Also, avoided the hardcoded 'master_physical' and
'standby_logical' names.
Removed 'node' parameter of wait_for_xmins(), since now we can figure
out node name from slot name.

- create_logical_slots creates two slots whose names are hard-coded
using code that is cut-and-pasted.
- The same code is also cut-and-pasted into two other places in the file.

Didn't remove the hardcoding for slot names, because it's not
convenient to return those from create_logical_slots() and use them in
check_slots_dropped(). But I have cut-pasted code in
create_logical_slots() and the other two places in the file. Now I
have did some of that repeated code in create_logical_slots() itself.

- Why does that cut-and-pasted code use BAIL_OUT(), which aborts the
entire test run, instead of die, which just aborts the current test
file?

Oops. Didn't realize that it bails out from the complete test run.
Replaced it with die().

- cmp_ok() message in check_slots_dropped() has trailing whitespace.

Remove them.

- make_slot_active() and check_slots_dropped(), at least, use global
variables; is that really necessary?

I guess you are referring to $handle. Now made make_slot_active()
return this handle using it's return value, and used this to pass to
check_slots_dropped(). Retained node_replica global variable rather
than passing it as function param, because these functions always use
node_replica, and never node_master.

- In particular, $return is used only in one function and doesn't need
to survive across calls; why is it not a local variable?
- Depending on whether $return ends up true or false, the number of
executed tests will differ; so besides any actual test failures,
you'll get complaints about not executing exactly 58 tests.

Right. Made it local.

- $backup_name only ever has one value, but for some reason the
variable is created at the top of the test file and then initialized
later. Just do my $backup_name = 'b1' near where it's first used, or
ditch the variable and write 'b1' in each of the three places it's
used.

Declared $backup_name near it's first usage.

- Some of the calls to wait_for_xmins() save the return values into
local variables but then do nothing with those values before they are
overwritten. Either it's wrong that we're saving them into local
variables, or it's wrong that we're not doing anything with them.

Yeah, at many places, it was redundant to save them into variables, so
removed the function return value assignment part at those places.

- test_oldest_xid_retention() is called only once; it basically acts
as a wrapper for one group of tests. You could argue against that
approach, but I actually think it's a nice style which makes the code
more self-documenting. However, it's not used consistently; all the
other groups of tests are written directly as toplevel code.

Removed the function and kept it's code at top level code. I think the
test group header comments look sufficient for documenting each group
of tests, so that there is no need to make a separate function for
each group.

- The code in that function verifies that oldestXid is found in
pg_controldata's output, but does not check the same for NextXID.

Actually, there is no need to check NextID. We want to check just
oldest_xid. Removed it's usage.

- Is there a reason the code in that function prints debugging output?
Seems like a leftover.

Yeah, right. Removed them.

- I think it might be an idea to move the tests for recovery
conflict/slot drop to a separate test file, so that we have one file
for the xmin-related testing and another for the recovery conflict
testing.

Actually in some of the conflict-recovery testcases, I am still using
wait_for_xmins() so that we could test the xmin values back after we
drop the slots. So xmin-related testing is embedded in these recovery
tests as well. We can move the wait_for_xmins() function to some
common file and then do the split of this file, but then effectively
some of the xmin-testing would go into the recovery-related test file,
which did not sound sensible to me. What do you say ?

Attached patch series has the test changes addressed.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#79

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#78)

Re: Minimal logical decoding on standbys

On Thu, Sep 26, 2019 at 5:14 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Actually in some of the conflict-recovery testcases, I am still using
wait_for_xmins() so that we could test the xmin values back after we
drop the slots. So xmin-related testing is embedded in these recovery
tests as well. We can move the wait_for_xmins() function to some
common file and then do the split of this file, but then effectively
some of the xmin-testing would go into the recovery-related test file,
which did not sound sensible to me. What do you say ?

I agree we don't want code duplication, but I think we could reduce
the code duplication to a pretty small amount with a few cleanups.

I don't think wait_for_xmins() looks very well-designed. It goes to
trouble of returning a value, but only 2 of the 6 call sites pay
attention to the returned value. I think we should change the
function so that it doesn't return anything and have the callers that
want a return value call get_slot_xmins() after wait_for_xmins().

And then I think we should turn around and get rid of get_slot_xmins()
altogether. Instead of:

my ($xmin, $catalog_xmin) = get_slot_xmins($master_slot);
is($xmin, '', "xmin null");
is($catalog_xmin, '', "catalog_xmin null");

We can write:

my $slot = $node_master->slot($master_slot);
is($slot->{'xmin'}, '', "xmin null");
is($slot->{'catalog_xmin'}, '', "catalog xmin null");

...which is not really any longer or harder to read, but does
eliminate the need for one function definition.

Then I think we should change wait_for_xmins so that it takes three
arguments rather than two: $node, $slotname, $check_expr. With that
and the previous change, we can get rid of get_node_from_slotname().

At that point, the body of wait_for_xmins() would consist of a single
call to $node->poll_query_until() or die(), which doesn't seem like
too much code to duplicate into a new file.

Looking at it at a bit more, though, I wonder why the recovery
conflict scenario is even using wait_for_xmins(). It's hard-coded to
check the state of the master_physical slot, which isn't otherwise
manipulated by the recovery conflict tests. What's the point of
testing that a slot which had xmin and catalog_xmin NULL before the
test started (line 414) and which we haven't changed since still has
those values at two different points during the test (lines 432, 452)?
Perhaps I'm missing something here, but it seems like this is just an
inadvertent entangling of these scenarios with the previous scenarios,
rather than anything that necessarily needs to be connected together.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#80

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Robert Haas (#79)

Re: Minimal logical decoding on standbys

On Fri, 27 Sep 2019 at 01:57, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 26, 2019 at 5:14 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Actually in some of the conflict-recovery testcases, I am still using
wait_for_xmins() so that we could test the xmin values back after we
drop the slots. So xmin-related testing is embedded in these recovery
tests as well. We can move the wait_for_xmins() function to some
common file and then do the split of this file, but then effectively
some of the xmin-testing would go into the recovery-related test file,
which did not sound sensible to me. What do you say ?

I agree we don't want code duplication, but I think we could reduce
the code duplication to a pretty small amount with a few cleanups.

I don't think wait_for_xmins() looks very well-designed. It goes to
trouble of returning a value, but only 2 of the 6 call sites pay
attention to the returned value. I think we should change the
function so that it doesn't return anything and have the callers that
want a return value call get_slot_xmins() after wait_for_xmins().

Yeah, that can be done.

And then I think we should turn around and get rid of get_slot_xmins()
altogether. Instead of:

my ($xmin, $catalog_xmin) = get_slot_xmins($master_slot);
is($xmin, '', "xmin null");
is($catalog_xmin, '', "catalog_xmin null");

We can write:

my $slot = $node_master->slot($master_slot);
is($slot->{'xmin'}, '', "xmin null");
is($slot->{'catalog_xmin'}, '', "catalog xmin null");

...which is not really any longer or harder to read, but does
eliminate the need for one function definition.

Agreed.

Then I think we should change wait_for_xmins so that it takes three
arguments rather than two: $node, $slotname, $check_expr. With that
and the previous change, we can get rid of get_node_from_slotname().

At that point, the body of wait_for_xmins() would consist of a single
call to $node->poll_query_until() or die(), which doesn't seem like
too much code to duplicate into a new file.

Earlier it used to have 3 params, the same ones you mentioned. I
removed $node for caller convenience.

Looking at it at a bit more, though, I wonder why the recovery
conflict scenario is even using wait_for_xmins(). It's hard-coded to
check the state of the master_physical slot, which isn't otherwise
manipulated by the recovery conflict tests. What's the point of
testing that a slot which had xmin and catalog_xmin NULL before the
test started (line 414) and which we haven't changed since still has
those values at two different points during the test (lines 432, 452)?
Perhaps I'm missing something here, but it seems like this is just an
inadvertent entangling of these scenarios with the previous scenarios,
rather than anything that necessarily needs to be connected together.

In the "Drop slot" test scenario, we are testing that after we
manually drop the slot on standby, the master catalog_xmin should be
back to NULL. Hence, the call to wait_for_xmins().
And in the "Scenario 1 : hot_standby_feedback off", wait_for_xmins()
is called the first time only as a mechanism to ensure that
"hot_standby_feedback = off" has taken effect. At the end of this
test, wait_for_xmins() again is called only to ensure that
hot_standby_feedback = on has taken effect.

Preferably I want wait_for_xmins() to get rid of the $node parameter,
because we can deduce it using slot name. But that requires having
get_node_from_slotname(). Your suggestion was to remove
get_node_from_slotname() and add back the $node param so as to reduce
duplicate code. Instead, how about keeping wait_for_xmins() in the
PostgresNode.pm() ? This way, we won't have duplication, and also we
can get rid of param $node. This is just my preference; if you are
quite inclined to not have get_node_from_slotname(), I will go with
your suggestion.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#81

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#80)

Re: Minimal logical decoding on standbys

On Fri, Sep 27, 2019 at 12:41 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Preferably I want wait_for_xmins() to get rid of the $node parameter,
because we can deduce it using slot name. But that requires having
get_node_from_slotname(). Your suggestion was to remove
get_node_from_slotname() and add back the $node param so as to reduce
duplicate code. Instead, how about keeping wait_for_xmins() in the
PostgresNode.pm() ? This way, we won't have duplication, and also we
can get rid of param $node. This is just my preference; if you are
quite inclined to not have get_node_from_slotname(), I will go with
your suggestion.

I'd be inclined not to have it. I think having a lookup function to
go from slot name -> node is strange; it doesn't really simplify
things that much for the caller, and it makes the logic harder to
follow. It would break outright if you had the same slot name on
multiple nodes, which is a perfectly reasonable scenario.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#82

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Robert Haas (#81)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Fri, 27 Sep 2019 at 23:21, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 27, 2019 at 12:41 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Preferably I want wait_for_xmins() to get rid of the $node parameter,
because we can deduce it using slot name. But that requires having
get_node_from_slotname(). Your suggestion was to remove
get_node_from_slotname() and add back the $node param so as to reduce
duplicate code. Instead, how about keeping wait_for_xmins() in the
PostgresNode.pm() ? This way, we won't have duplication, and also we
can get rid of param $node. This is just my preference; if you are
quite inclined to not have get_node_from_slotname(), I will go with
your suggestion.

I'd be inclined not to have it. I think having a lookup function to
go from slot name -> node is strange; it doesn't really simplify
things that much for the caller, and it makes the logic harder to
follow. It would break outright if you had the same slot name on
multiple nodes, which is a perfectly reasonable scenario.

Alright. Attached is the updated patch that splits the file into two
files, one that does only xmin related testing, and the other test
file that tests conflict recovery scenarios, and also one scenario
where drop-database drops the slots on the database on standby.
Removed get_slot_xmins() and get_node_from_slotname().
Renamed 'replica' to 'standby'.
Used node->backup() function instead of pg_basebackup command.
Renamed $master_slot to $master_slotname, similarly for $standby_slot.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logicaldecodng_standby_v3.tar.gzapplication/x-gzip; name=logicaldecodng_standby_v3.tar.gzDownload

����]�<ks�8�����\�D�%Y�����c��o<�Ov6���bA$(qM�:�������@R��Nfj��jU)S"
���h��h4kg��j~8�l��a���jaP�1��S}�c{��w~���:xm����>�v�������{�~��l��h5;��;��|�b��|�����;��{���s��j�E��oN{n�1v�N���^w�u�N��kw���m�K�;�d�>k4N�k7K�������@#G<Q��4�?��?� �2��p�u;\�-]�X������f�0��!k�NZ��f�5��F�.��]��	�������y��oF2���2#�hd�TR0<��~�8dS��H��N�M��#���K�H>�0������-_|>���,x��<�|!%�a�f�z�"�pM4�AQ��%��]l�2��R����`>�1<^��I8����:�.���9�����K�vK�E��x�}+J�b=rQ�a��l<�'sQ�F��<\�`.K�0��2��y�#L.��	8�!`�H&u
�P/�j���}<����cn�@��8�����G�T�f��+k6�}��!a`���cM�cb�����VK�O����Y6n����n$H���i��5$������pOD)�_0�Fy��'�0�#��Y���|%�c�r�l`�L8U6����p6Y>�TY{��u�V)��uY�6����96}����Yo��u��z}���[�fMp���Kf*^4�O?�Zg��W{���`���ul��BBG�	�)O��Guq��~+�;>,�C�u2Sx��e��D�:�DE3�WP�/�O�s�;�FI�*�%�NZ��PP�'R�S��O�u��j�����m8���y��'D�z�$Z�iy\:���k�_:�M����V�K�9������"'Q�r���M���t�[��t�a0�U���������Y��/�"�������=����U���Z=}1��n��u��z}0�����[��"�����(v�a�����&J�=�|�����s������?V�L�6�K���p�� �(;�W��H�0u����Sp�����!������/����:�pv�~t������<�^�R�>���)���T�RA����2���V����:�V~��e��$� _��=Q�D�V��Y��|�2L�U<z��T�JE����GN�m�y�����%�h�s�":�6(����+���|��D��j�!���������-O����l�
�&������p)"B�"l�����GaJF]���M��)���9(��%3x`	�T�u�>��j� ��!�C�c�BT&A���$�u=�����_��]W���.�(.�&������p�(B�������5~����{�f|o]�X���d�����G�������e0t!g�W{�?���$F%�����W�r���$�A�D��R����T�7T�L�n�����������E����+�U7��������tm�����������u;���/<����}����u9�Xh�7��,(z����������j������u��f}3l/3�<��a��j�#���������D��Nj���W�~iLE�3�Y~:�`>��1��CFqa
u�~|1>a���dE�]��`+��p@�e�z��Q� ��9p
�s0���L��'%��~J�P�O��fg�2�q�)��9�@�c��V�b?�w�`��s�����B�^c�c��?�Y�P��Z�"�0�0���Z.<A&��E�+� IC�2�����'nm(�3�8*
W`p�L�8�]�D��l�bK*G�Dnd�2r�����
�dXW���_�D���DraKu"��;�_D�d��:��d�KM��>WI\>�
���>
�D��
 ���t�r����`fk[��rtv�,	J�������~tQ���%Y�V�NU�/H7h,'E���PzR��Hs�<�6)���g���C'	V9����D��A�	����9�e8*`$�NI)�U����f��!u	-J����J������*'88��FU5��6��nu�Bu��T�J����bNxI��b�
�6O� 9!R���+D��Qp6C��n���g���^��B��[���huZ���:�jS���`�R��� 7i/��(��{Bt���Prc���c��;�#%���qPH��H�C�Q�G�0���)\*�\����>7	���	Y���ah�������x
�7�l�����I�wpw���ACv�l���isjS��v�ea�F�?6�@��5��*�u�K���P��B���"�0�%%��~=a�_Y��d��@��tTegE	��8f,��)��e(��7Q�����
]�d r
4`�NX�v���#
Z�-l%��vB�q��<0��w�����IK(����a��
kG@�b�T��R1��)]��;�e�8,J��$�]��������X�������(���2��i���F�H�\j��b�&)���J�0n�9�X��)��\�K7�+����I�$���^�����I�������� c��������B�����vZX�U���T��s: |o6G�x�F
�C�n���+�<p�(F�Q%z�|#���	$=*o[�ed�b�:t@�����9$�� ^*�T��>�i)	�4*E�M��5���J�j���6����X����!�����JR����3"L�<�T�W���|����oV����
����K�(�H`l���>1)���A3���+*#�K�5S��0�@���La�'!��&�%"yM�CS�A(D�Y���g���u���p�r�0���`����u�~}����?����l�����+/P;X7��4e��^�}1��`��X���UxA0�d�}���bvA+C����)��Z�k|����
;�-8���S*��]�x
��~��kE�zn�����T��(7v��]��CN������~8�)`)���P{�H �R�S701�����z�G5>j�*T@�mZ�~��)�����H�����.}��\?��e�������DT�W�-��G>�^J|�0�2���� M�6�X��B0l��� Lfs�
(�B��������L�����;���T���1T�� aw�����'N��%�-��������|�J�us��%U(�V�1�JxA�.-��pHK.j���T���%`����7���<��;\�A������U���n����n��z���A�C	^��N���1j�#s�����*o�2�y�c�M4EE�U��TlN�"���C���c������A���<	H�>	
O���w�sN�?.1@R�o7���_�x��{�dxp��C��"��|����*%��Q���������9fVJC�We�A������g'�
k��T^��[�����O ��u���!m���A�����]����#7H��EOT2Aq�g�JU�	s�"Ln�1�0
�'��n�	8���+sP��1h�(O�c'�>����|�T<�X!BW����A>?=�z��)
�j�i�Z���?�1����Q�(�`�c��&|���9_V��SG����o�:��~�
��m[]0$��=p�%��Gg�c�sb�pH6c�Y+!�
(>�p�n������T�K!�(��]�BE|��[!^�Q��G���q_�}��&�=��<��pF�X���'��Y���gDG��Pj���N���nD�����������0��?5Rj�J�	�v���A��6��oh	�����4[�B�g����?���p<��n���=g��
�-��7���hL��v��#�?�'������:�#s���d�d�*	�Be)=�S�8�z��	�5�x(y�K��8���c8V��#YR���LG!:=G���aF��Hu���O_��Fj=.�@4p!x���_K������%�)���*;rUq�i]�u��`2�J�-D�6t��,��vTn�	�r�j�`��0�8R������R��u��`�2�-��jJ;�Qk���6�H��J����k����Mg���O�g�W�Z��)Z��dG�z��$�|3����s��|��c�9�s���j�Z����/�_d[��a��v�����I������L�x��z�?-��:kw�q$^������}7rI4��Z
2��[z������\��I��t�fo�_wk�-�i���J�����=����A=�0��R�F [�����
��Z�#����7{��D�0��u����3:�_	{���������o��������
[��s:�i���?l��S��{v��V�<��}�y �p�mLm��n�#pN��C����FB����J|\�%��*{uwO����7�u,�-�aX�d�`ko��?��+����S�Z�/	~��TOiY7b��V��]���R;�U+p���Y�������'-_p����.����%;a
=� iy�L
��	��������S�!�C����d1]�}_�m����^
�eI��m�.�R����~�]x�(IG��&J��h����<!w�,�I��B
	�+�=���x�SA������G�(P����=<6�S�3�+(�T��;'m;�V@��G������3�V������xP�RhM��{����7c�U�2�z�x��`?���a���fJ������V�h��

��1hP�L]�h�s�Ms���U�:�=aCS�����t�����`�_s
��UA��k�~a@���s5[C�t��Qj[��T:V�Zd��X�98�0��o�� ����}����6��
�?{�� ���N�TF���J@������eH����j����#Q�&"����,xx���
��\���&�m�3��2<������7^,��^*i�oO[�������-�x��?��P9����x��)��3q'�t���=�vDmXW�]�T�V�����a�=�C��&G��W�;V����3��u��2��\���j<��X�3����u�_J���NH�f��N����(�����5�E�^c�i�I:�96���6���s�R��� WRUM��]�;�Q�)����tr���F��?� ����p���E ,�i�|t��-nU�
0������sZ�z������YU�D�S7AU���TN9�o:W`�r���W;��d�U�({�K��Z���Xm�����d�������_Q(	)^p�ms"a�bT�j������L
�^u!b���e�;[��*�zS�P�{�?A����[��@]�)��V���*�I�
����/���+��(b��a��������kt��5$�v
��ov��e|�����`�����`�����C�"�����"�����f2�l�~�dia��\1>#cu	���7
����
��(�#�g}�NL9�}��Ga-��$u�����h�������]�����f�_(�z����T��3J=�p�1�:5�����
��t$�_	�����&�� T�4�/22)��Y�fD	�UtK���eP�Vz�G����"�p�_9���HpG���V$m��b����O����I����5�v�f������
�V�9��kl�[U���I��s��G�F�)U�L(�"��p�����V�c@!�	n	���}��J���f���kx��/s'g�r�6��g��|l�Y���>��������(�������m#	p�J��=!eQ�C�5#Kr�7���dgv/H�2�$��lM������F7.B>2�}�KLh�����'{����'T�YU{j;�����o�����8y�O��l2Y��7L�'���O~�5n;
��G�V�m9�N~P��o4�\��:���mo_��/��8��'�5P������a�����a��9���K�F��%�.��|B
>�}00_!K������*r��Y��Y��P#��z *��%�P�j���T����m�����
b��`�A�Dk��$�5�@m����/�g��	��("���)y�W�#?g��K����\zJ��,�k��4�31B�Z�������X�c�(�RP�@���1sQ���Kl�?�w����Z��\4H�DY���d�e������D�<�tv��Z�5F+�!"�w�0�YI���!������WB�Ce�8e�G��rx�-A��>�Y��<s�5{-���8 �A�d�P�<� 1
qk:�~��=Px���{������������H���F�A:L�E!i��c
I�����m�%�����A��~�bA���B��wW��@�����8t����2������n�T�����q�����=���R�H�t"�-�-k��!�H��5T��d�d���>s6!��T�4�W�����z@�1t��u$�z��'4����9X��SW7��Y�b���(���� ����	��W(1�{!��w�5����x�RShA�������@16<�"�z�Wa��`)?�<��0���`��
@>�I�j-1ES��\�5��������m��8^<G0��xL�2i�N,�^�R	puMl�>���	�]���A�7j�+��L+y��������I���quyp�[4�%��>����Ab��Wf�^��%$��nA�Cb���x����FA����}O�u'E��;wf�g�sq=t���,��������C��:��	��Q�2RU�0�rJ�l��h5]�T�����<���\wh�
Ay������ajQ��y���hza��~��#Z
�E�5����kBn;2J;!�h���d�(������������.0�qjI���������W�����+��L4rB�<�n��f�ww;�n�e����nx�
�Pr?�P�n��v������Yo������<^�s>K?f�6�1���"ew��0�n�d?�~J�]�p�u��V_��W���u
V���G���8��YZs�����A�~�K��7�%B���-�z��a����K���3��j�&X����7����\��QO	��`����'�	�f�1������)���X�_�u\����e�'M�z�o�S8q������Ni���l��gL��e�sL&Y��|��\�b�y����t!`�3�7��i
i���N`�M����]x�7"i�|��3A���9��������
����x�BP7�j�=F���k���D������
����&���[�
����X�t}�DG������,�	E��'C��Ky�$w�fm�!x���u��L��0��N����v�7j��. r�Po��c��������j�{����#���U�qL�Z�x�h��\kg�p�����T�5�����N�5.t��@���h���"y����u`W{�����o�c�����PV�N������g���w��1}�|�c���A�Z��A���|5���-�����,���?g����F��j��x�j����}2����s2w{�<[+����n
I7W/�������Ihc�'s>h��U_Z�����U2���'y�lK%(�Z���|����u�Q�V�gu�������^����'�lxL��.7�);v��~���S4�gl��q�����rAb��Z���x���'o�Vw}A�^����Bc����nf�B(���-��vX���Y���v�����������\�Z������2�=u[�14F���v�m���Y.���5n�Z�f�H�NQ$3F����\z��:������z�6Z����Aa7c��7�x��iR�!���n� �^k�kClK�u��z.7��q����� 8<5������V���l�$��c���q��7J�)�tx�o9���r'*�6L�����M)�]��qzL��e��YD�!�������;��&���
����iiH��-��-�0�'d��X��A�H�'�qC?Z@�e�������lNq���jj�![�kj�~�+����8��
��Y��[4����:�k��%��-�	l���2�8j�P$6�?��J�>G�Zu�6�Qc$�q��2|�����'����lw��� 6�����;%h������ �6�;��b5���T�B��K��Z���{��-0��?c3���s#��A���u"R���a�9�8�����9Q2��Q%��
�i"��mQr��j����y�M����XC��G�a���b������7�����8�fG�����=��G?����������l�5�v��6Z4�J��h��CUN������W��h�
h9!��W2�w������.�o}�c6t�V�
2z@�������F3L�L[7�CT�9������j[gr]k�g��z���J<�g��(�8 �_j$e�qB:L���l�

�]��@QK�-g�HgE����s"���!���%d;�
WH()����5&IB��R�b�d����i��{?�z������� ���,��e`��u��v+���U�7�����5p'n�Yw��fg2�����1l�7���I�Yo��������j7s�� ��l����e���2����j��e<�UH����
Q�*Y�xp~���q��TqX�Q�i���Hq�Z���r�QvI�M�F�)��`�m�O^��sgz�9�hI�'���tgjL2/����L��\y�iw�����-���'R������:���O����,+������g��Y ��V���z���T2I4o�f��q�C���
�V�� Q���)�c[����qSo�!�����tXf{���w�����-��l�� ZB
�V[K5]/Y���b��fe|Lx�u1�[���n������)��b~k�O��Y���,��gr�'�q����a����>/��s�03��.{����`D�\w�	�x92�_��e�VFKE������������ek9�z>1-��E?0g���i����p5�	�.b��F2_W��J�UX���t�
o����7��m�i�&��Hf^�t�MP����$��Tp���������(|3����������W��*����������YGJD��=����0�(��?w�oj���:��)��dO�����#�8,������E��m�����RUQ6C�o�y ���$���y�<�w�hw��ZkFKf��?��Uv����w����sl{�n6������ /|�,:P���w(e�u����F��s]�_�5�G6�A�z���gn\xQ0��d���a���|�JP|�����
kd������������/��@D��p���A�f�\�P������ON�/�'�g�����8���?��}##�F�\}�
�����A�L_��mp]�����p�F�$�]��*���{���z�F��;#!�&��2p�g���I�������j�J)���)��/�j����;k��|ZB5
W���g����.^�)���>,�~��5�����I��s�g��^������� ~>)����E����Y�S��c�&�����e�]��&���$�Q/Z��HU��z��C�����Sm�y�,l�k��A�����7��>%B;/%�������+Z7(��$�XE��jV��_���U��L�zN}�@��c g"��~]�LL�|E���<���X�5��.&��g�����neU���c�6�0�m���B�[�6��������A�K)p��L+�����X�;^K�e!T:�_��.��_�eK�'�Y�F���	�q��q\��X�h	��S�,���@8�J�X��yw
zx���nrwi~����.<4|�����~�����������x�d+���vZ�����q������N�)��A�7���P�B�".�.���O�I"z���e��
���7�]�6-WD��L�<���|��KRx����3�w��jy]���v�m���K
N��z���\���k�]����:�t
�g�xIT
)?��$�)X���]e��v?sR�6�^z}�kz�FM���V�3�'��.a��e�{������-��������Z��80��	�R���R�4��=�f��m�G��	A0����`���7$���w����#X�
"{���M����?l����-7��"����\������X�W+0;)i��"�	��V��`[�v�dkF?qU�V� �,���p���fl�%�Pj2��#�$�n.b��cF��#��V��b����T@��BpMb^IZ+T0�"�
��[���@@HZR�x$O`��/O����~:=s�M�FO���9�L��g���Z��gp�e%�`Rp�-�X�p����d
L�HTH�Y#����;e��d����?��h���p�f��yx����o.O�������r>��soLN]OI������f���� � �
�u^�p��MG��qT��<�����uk���~>]��&m�M�x�|������	TIy{z��2�H��0v)8��F��Fsw��N��6��;��}�����'_

\,��4�`�5ijY�e)����W�W������U�B�b�\�NZ��i������u-dhC�Q�u�a�o;n�^ Pd�H�sV+J=�D�#~#2c�Q1Gh�����z�'�C����iz�����?O���2�OV+�^�{_"�����0��B���%0"�-T��Yob����iah��3�J��9�����Ic=�E�q� �p�Z�gG��?	i���������~����jx��������>2�Es��CL�H��P[�����}'�6������������'����;�V-��R���$��85�b�����Z��j�����k �+�s#�s�����I?�
?�����m�&BTv6�/���1H�����&+�l�[�20�)D
N�E�7��~H��`��[-��jfD5��Z~H�L����yW8^�j_'.[��[�R�y�(f�saP;(�\0mqD��r���,�]��@`.=��U="2emK"�����U&�Nc�s��n��g�\l7!.��c��B ��jO��:S(~k��	�aBp���q_HZ�_��:qAwe G�����|!�:��?G'�=c16w�h5(I��w���
��pA�;�?A��G��p���e�`�7n0<2g���.��	}8��)&�*��z���#z<�k�7���������KA�t������\�L���'L�:
Ez+��Hp)|T\zu���Fv���z�B��B�0��T��vB�i���%��l��&���V�����lx�v|���O�=1T+����#���3���jU9�A�o�����x������x;�Lx���I��,�e�O��^lJj0rp��Ur������cj	���b��^��F��P���V�tr5|����!��C�t�I��]��\|�C���X.��8:4bAQ1$�Y*���(�%��S6([%]U!��n]�[�����<�O �!���b���:��Q�")��V�*�K-'u��s�bi}Xu�h���^����"@�c��(�r���j`3��)���g�-&������%17��u����ZUr_��00��;[RCi�XT���H8B�i�fl*�4���g6�����2�|xz5<y{��J��\�=�:={=�����s��9�8S��%���p�|l�P+9^�!a�A��)�E(-&�y����\q6�s����W�����I�)C��1�+Z�7>���!d��h*���mJ��n��f��.�/��~v�����8k��G,��6Uox��cX����y��q�p�[��cdq<�{�B>�o����8�����0Xk�����C��y}Q�%',�<|��������L���X����r��"aWS���B�E:@��8�@�r�&�0�e'�%op�G���l�V��bTE�� v�����0^�h�D���L�	��~(M��l"x���y����O�����$UN�	����fT(�,��&F����c��`������	7"#S6�������P�Rl�����M���:zqr�������������7�W�U�G�]a�b��
������"�(�X���������z��]�����c��K�����.f��������a����z�Mu���2���y��?��#����9x� �y�[�Q"������O#��j���>�� Q���N���Us����l"����\��Ay1_jnx�����?����c�Q��
�Q��)�����=;���)��A���Y��K}������q�8O�D��!�<M�nK��T�j��7��d_.�U�V�$p�V:�������HPh��+�!B���r�lv<��9�Ze,EJ�IH=�����Rs$~�bzl�4>XS_0�����X	B ���hZ(�X��}��l�k���A�[�n��mo���Ls�\�G�xVs�������3���;�\/:O~�a?P���E{g�Y������j����U�ww�s���sr�n���/���D�C�������i8���_-1�-k��2Z��"���
������+�:�	�J�^e�E�e�N�v�|W>���H0PL����%�������a�	�7�������K��u�m����
��mkJ��|�z���F��`�A��
�L.hU0�����%��I=���H�(L�%kQP��DH����6M2������I���G��'�X�������q���[G�O�n>��|�J
|>]E��yH�9#)�����Q�a�~=��3B��%������`zk�H0��(�������o��l�i%CK��x�Nl�;�t��x���"�%�+v)/|7��'W[���X��=����C�W�*�b��L�R����F��2�*,�J=H�UQ�R��wd�I�FK�R#VZs�V�+���>�+����j��w�P�	/�b*�x�e3�S�[���c�d������x-��X��R;���YTU���ZEac��8�����^�^�t)�41�R�b���m�w2��S0�����B���^��m�Px�$�nw������7�������gG���HB*���{��$��r�p�������o�vA��l0�8��t��
2e��2�q!�r��c����-���$�D_a�>����T'r��*"���G�4���C���(Q��[���8o(V�0�n*����TY�X0�D�~+V��<#�J��"W�V�M��O��%WKv-Q��G�v��3'��T9}"����^��uzT��:��\b���1*�+��n��B�P��w|:���)C�n�v�� �7.�������U���>=��FP�p�+��;����P�n<L�����J�3#O������(�%PgP�q�ZR%�R�.���=���J���!kh�s9A����P��%f���w��f�vo�9��a����T9�5
���<�<?<:)�<F��2�H��T�'�R*�����"x^s�muk��V��xb�yH"e�`��2�
�Q5�+���(`�Tk�	T
��[��x!�i!�U�C����ha*�1�0*�����S���Z>�;Y}�7�z�{�|��D+�(d�0tz��9H3q�`},�$����V&J}�my{�@�j1�EcZ�a���������"~��m�^������O5��4VP�X�s*�i��6�'P8�)���};&lF�m�\��@����)�AP1��k�B+V�Z��g��*�N��U�1�Jr�q�`.�vo2p
�<8yU%�-����P>b�n6�eu���O��o^�)�e-��cg��m����N�@{!���O�Um��/��N��u<*��������;�1�R�+(��N���=����*�u�F�!s#�F�K{��Z��l[�5���@���zs�l��v��(��^	o����t���bsy�N�������N��#��5��;�-�^�F9|�H=V���d�����	��l��(K�h]��vA����A�6u��`�s�g��Ha
V�o����	�����M_���B�|��HT�1���/��4����.A�������/.X��(����~��g����v�O ezH��h����,���&)��[u�&U���Ti��3[;_?[;�fK�
L��MX���Y��	��;a�X�m��r�L���3O�{M�8e^�]5�������v���
*���[I�B��:n�n=�fY��/Q���I)��U�[�]����p6�S����	���:�)�����r���q����-��}��d(�
@�W;�
��v�-d�:8�%���V�=	^7R�����O�BZ[���7�����EQ���������2s��6Q�����:���t��8#w`!��Je6��p�jK�D���Tm���k��@L/n!n�WK�T��TP���&4"�"^%��5{r�}^E�5�O�M���bVG��!�/a��.�����������	���o|q��6R�H���j	��K>j>�-4���6K�	=�R�k�����Y1h�g���c>�������j]��Q�V�`�F��]�n8����HK�N�f�Z�$��j�mHB����4�.DCP�h�[6~�]�]��|Z��^+�'�M�k�����O�i��X�0�����������TX�������������������]���� T�@d�����U>�y�?��^��?��V���v�V������G|���N���R�����D:��<z���:���ngp�GAf�������F���g��������6�)������{�n<UH=7�%���y�I��P�����8U���~G"W��2�>�1Px��N���nqH�:����3eoj�y����Z>V�i���K�CVc�[���~���������j������K��J!�Wv�
]�����R�o�����m(�>uY1;\���qc���;q�@4�U���T�����5x,r�V)��$<��@*z���k��me�B��C%P4���3�{9o���i.��`L��������1��HY����3�'q<�����+����g4����e[}\4cnu�n�C�?U��nS��x4�#5�����%
��B�+��{�F_b��R���,����������3���%�`��>�����6�H@�\�fo��N�\��d�^�F�����s�A*����Va_���*���=�Hgj���?��V��fzBT���c���	)>.�t��\����G����W�CD]b��7/����w,n������+&.����`�����}D(*�x�'���#��'!7#�z	����b=O��C�Z��e�����d[��{��*+����m���bJ��jn����|��8��wU�at�	$�p7��
q@�����2C�)G�"�L�������L(��Gm�n?��H��R��fe�����x�
F�By[W`�o+���*^��,"��k��L|(�(�D�����Bu�Q�rL�CtU��KZ�����~�.���������M���Q�����yW=��x�mI��L���$Z*�
��{f���g�A�fO�
c/zfsG����UL�A��'�D�fX��!��#����S4JN�a����0�b�� ���q!�6'?K�n�S�9�!&Etn8��s���*�7Z]7+tCY�������<��o��X���,|�`�p��l�z�����H7�!�b�����jx4\(H���g�� �?h|j��I���������>���]�(	}��OH]�d�3�?R�+3s���V�ZIR-A��I:	C����F%�EjR��J�y��%������Eg&�,��� �GI�`#�A��W�+b�[��r+�L~R���>�X�.Kl�h�.rE�2^8�
��X�z�'	�P���L��vkb�UJ�����e��X@��E�&�K����9�USd~�8`����y*u|���4�W+��p�O>��Wo�XI�g���xF�>R�9'�/U	z�e�	o�&djU�}�y.&&Br|��1DzJ���P�������n��tG4��8s���[f�qY0����Q���`@����<x��V�1F�=-ZH&V�����@�`,"���
��#�����p��Yanv&[TM��
��>��^�]�_�}����$f�@���8��8�%�M��B�\��y��8� �������5B��x	-h����i��
F�B�:B;���o�6�Q����s ���(%78�J�ZdX#��"������" �>�P|���
0���4��K�����"�8��"��c�L/�y��@��0-���Vd��4�|y�EB�g)hKyCP�����1f=k��p%���{�PB0D��Zt�E�n�z"6�@�P��r�B�O@��������HDvAd�UE0�t*�9A����$)&	��������RB�����3CZ��o��3 `2'U�|O
�R����|�sP�mCt^@�&���CB=m�:�dgc~�tN��*<�$�L2vk�),�f9�����!��^�S���������J�,��Z
�|�W�]J�e��
g�FT��e��T�U*��O �h��.{�����P�#b�r������:}}u��`mSz���V�%$���]Y�H���z��U��y��;����m��H�%s;���y\87��)L�e�*����4�(�r�.�]����@!�&INE/�Q�Ht#�����Ju������xy��jz���h�������!LGJ���
��0�"���j�cU���X��:�����9n���z���Q%�N�o�?;�������b5<�o�}�6�����������7�.�t������;Gg�^�^}��L������a���U��n�.I$�������3��@����qf=����#�-��XBB5�h<���X{"���SHB&8hd�!\Kp�SS���$���bn9N�����r��"�>d8��^�8�e?d��a��:�?�Kq	����u���f�^�`p�)9�B&����7'������>����h��w�-�����C����:���>��G0U��Zlml��J��\����=�d_~��z��}�&O�`b��kDc����4�\��_QOa|���~wi1I�=�&�4�`�stH�{i����Y��o��BN��U"����c�gD�;�Sh{�>~��s�f:S���[6G�LJV,���&��-T����_Mw"���wV��m92�����N\L�$��2<vX��B1����������@[O�G$���jA6���>��`�`����c���=�W�y��j�tA��	�yg>`���!��V��H����U�L~��&m���,nD�>g:t�y�9������Ut�7�to��}��a���R�k��P����<� ����[T�6��{����<��r���p8���L��z��d!��TEc�b

H��d�?��'�)��j�J�f[o�������/O���;��q�E>��4�������-f����`�d<������Y-���q������D�G[9V��<����������/�#u>����������&�n�@P���O�u������b���_�����'��b�+�9��s1�,�tzbU����k�������!KE!������N�Wwg{O�c6^����R
��P��J�a�P�&����1�UWc����)��1������"����G�R{�F*#[f��c�[~���
H�$A�3��C����w;/���S�!{��E�E���G��$�3���8�9��S?�6CE_ ?J�Q���k�I��'>=s�(iz��x`:�)u���c�L���]
���1��h�����*�j�~_����[��v�����)p�-�|)WZ�>n��;���}��N���t�%��,��7:gA�W`!�I��H� 3��B�8fO�����mi��{W��6W���0@�e��j�vnK���_�hP���X�A%����R���<|N/�9�K��;�<P�����47P���DoVF�H�C= ��B���Jj@�-�,B`���mKGq:���p�������a�Q�%�z��;����y�;
�����
��TE�~��c�7"��|V2��e�z}s�9k����Y��)��`M�J�������\%>�(��r�%,�N�\��>�a�%&b�+�%q����.�W�.�q�7���Hb�KC�')���?�P+����S���a&����vr��"�Ak[#8�%PW��>�|d#�@P��(���Z��Lh�l}�!u���l��3�����qF����Rn��Bh���z
Emm�A+�,���0���?I������0�{	Y1OFV�=T���S��h��Y>�p�ZW�������c.�c.�:�"������W��/�������{������(�5����Q�kt�$����J�_�j|�S��]�I�;~`]
I�	��j18/��K5Y�,)�dgsOVJS�j�Q���jnI�5�B��a�_��p��1�(�}G�)
a��(�a(VLA��F/�,'l���$�}���U>�8����<"/6�p����q�2�!�
���>�g��(�����DV������f�L��'��MA�E3�[~Z��R1U�����hIU�en���bm2�����S�2���
����;A��7��$\ �����z�������w�b�y��0N��S�x����q��!^h"P�IR�(����O�##HdN�D�.�f?��/D�Tp%���K�Q*n��Rh* #�D�s1��o�w%����K�����W�U��H�s�0#�.�hIV����R�U)I�TJ�:���N���f	0��.	��j!�i���R+��������k���h�2�W�^���]vr���������'�uF�j��	w�<p��������ju0�[���5�����7;
q�>���y������Z����{����O������z���c���&m������j��z}����&����Y�-�}����� �
������+���������~�j�r��X��~����X;�N��y����������Vg���8���b�e�dd!�fY�7c��7-7��;��M���������F��I}6�C�;u��A3
���M���M�A`��F������h��wm��k���D��P�������l{�����&�N��kI"&F��)�ww���J*�%���j�n�M��A5�|�z�����A�����?��m����pg��X�vt�e�K2P�!>�)�{@���=����u����|�����x�1,�{����	�<��������!�d`:S]>\�K6T:�Z����.7�U��8i��N+�,����OPT@�/��Ks��ZP�!f���
p�!�����:��N�����e�U��Xe
�7�������=|5��v��A{9�T�\�1����|U�(�`l��#�d}X��P��H��3���9���(h_-~:7�&��z�z�9��+K#A�@������$������;�@2o3�2*���� ��*`��7���$����������I�~�{���7����L�p�T��~����P/��a���� ������H��p������?�������s�����?�����������j�@

#83

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Khandekar (#82)

Re: Minimal logical decoding on standbys

On Mon, Sep 30, 2019 at 7:35 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Alright. Attached is the updated patch that splits the file into two
files, one that does only xmin related testing, and the other test
file that tests conflict recovery scenarios, and also one scenario
where drop-database drops the slots on the database on standby.
Removed get_slot_xmins() and get_node_from_slotname().
Renamed 'replica' to 'standby'.
Used node->backup() function instead of pg_basebackup command.
Renamed $master_slot to $master_slotname, similarly for $standby_slot.

In general, I think this code is getting a lot clearer and easier to
understand in these last few revisions.

Why does create_logical_slot_on_standby include sleep(1)? Does the
test fail if you take that out? If so, it's probably going to fail on
the buildfarm even with that included, because some of the buildfarm
machines are really slow (e.g. because they use CLOBBER_CACHE_ALWAYS,
or because they're running on a shared system with low hardware
specifications and an ancient disk).

Similarly for the sleep(1) just after you VACUUM FREEZE all the databases.

I'm not sure wait the point of the wait_for_xmins() stuff is in
019_standby_logical_decoding_conflicts.pl. Isn't that just duplicating
stuff we've already tested in 018?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#84

Amit Khandekar

amitdkhan.pg@gmail.com

over 6 years ago

In reply to: Robert Haas (#83)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Mon, 30 Sep 2019 at 23:38, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 30, 2019 at 7:35 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Alright. Attached is the updated patch that splits the file into two
files, one that does only xmin related testing, and the other test
file that tests conflict recovery scenarios, and also one scenario
where drop-database drops the slots on the database on standby.
Removed get_slot_xmins() and get_node_from_slotname().
Renamed 'replica' to 'standby'.
Used node->backup() function instead of pg_basebackup command.
Renamed $master_slot to $master_slotname, similarly for $standby_slot.

In general, I think this code is getting a lot clearer and easier to
understand in these last few revisions.

Why does create_logical_slot_on_standby include sleep(1)? Does the
test fail if you take that out?

It has not failed for me, but I think sometimes it may happen that the
system command 'pg_recvlogical' is so slow to start that before it
tries to even create the slot, the subsequent checkpoint command
concurrently runs, causing a "running transactions" record to arrive
on standby *before* even pg_recvlogical decides the starting point
from which to receive records. So effectively pg_recvlogical can miss
this record.

If so, it's probably going to fail on
the buildfarm even with that included, because some of the buildfarm
machines are really slow (e.g. because they use CLOBBER_CACHE_ALWAYS,
or because they're running on a shared system with low hardware
specifications and an ancient disk).

Yeah right, then it makes sense to explicitly wait for the slot to
calculate the restart_lsn, and only then run the checkpoint command.
Did that now.

Similarly for the sleep(1) just after you VACUUM FREEZE all the databases.

I checked that VACUUM command returns only after updating the
pg_database.datfrozenxid. So now I think it's safe to immediately run
the checkpoint command after vacuum. So removed the sleep() now.

Attached is the updated patch series.

I'm not sure wait the point of the wait_for_xmins() stuff is in
019_standby_logical_decoding_conflicts.pl. Isn't that just duplicating
stuff we've already tested in 018?

Actually, in 019, the function call is more to wait for
hot_standby_feedback to take effect.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

logicaldecodng_standby_v4.tar.gzapplication/gzip; name=logicaldecodng_standby_v4.tar.gzDownload

�N��]�<ks�8�����\�D�%Z������G�����M���X	J\S�� c;7�����%�Jfj��jU)S"
���h��l�'A=4�h�;<h���\?�5��!��'k�g��w~�����xm
z��>�N������u{�����v��o����;��|R�b��|�'����[�^z���sG����h4��D���z�'�C>�G��h ��Q�����%
��X���5�G�����Ds�N�F�g��+���A�Y���O"LD��})���D���3��#v7O����N�Z#�ju�G�;h�HTn��������9�;��Z���f$�L�,3"�`}Fd+�3(D	K"6��L���D�$�a~"Y*�LT��~����x`��p,lf��'�K)Y{5�Y��4�5��E��,(v��KAK��\&�x�'�V�n�-v�y��|����--�	,���yXy�8
CX���D�D����8��C	���iz�,\V�4p�f��h��,�0u��J$t����E,���I��$X�F�Qa2v����{�v��<�|q���f>�����C���q�����|��e��F����
���6��H�@�U��{+H��=�dK�=gHel�h���N������6g���UX�y>�%s�3���h��"��d��j7�	����U*��y���������S�aK�l0j��3���t�{m���+t�(��T*�4�O?�Fw������`���Ul��#B��Qp��N�(@uI��~��
;���}�u2Wx�����D]�]����(������B������
'��g���S)�)[�'�:��	5���'��pD�?�CO���gi�j���r��W��r�:�I�����wk������"I����x���M����A���`���O|�_���`�g_�E�����{>��[���.Z�j�tgP-hN�#�mY=g��Ng��=�l]��G�k��u4\x��P������h6^V���3����hG�����fB��]�g�������C���}F&�����C�N�����"���oN&'����O��'�����1�zo���|!�)��B����;&���F��Y*��o^�u6�*�R{�"CQ���/��L������'����0xxH>r&�*}I^���"�T��#�2��7��������V��s�":�2(����+���|�����z�!{���T�F�-O���;>l�
���A����h)bB�"l������QJF]F��M�5��sP�UKf���4',�>��j� �1�!�{C�C�BT%A)��$�}u��s�;~yzrUSb�����z>��'uu�K�1��>����w�~~zg_����k�fr>9������w������y-�`�B����	�`K�����2���^��}{��- ���Z�F�SP��P�r}��o�J��?��/�Jwo?]B�Z�������7��5@7�W��q��������_^���]�|�<�����O�4��lo�l3Xm}]��CaY����5�����v3�9<��Q����/m���(�%�;Qr��������G��4f"��*	�"k�O��G��6�p��(.l�N����G��`3��h��lE�.������,�G�;����5@@����"�3����"��))B?RC[5�-�C���h/`�(OZ9��s�!�	�{"����!�!\{�A�8�gM"5C{e��������j{����!��$ma��"3�k��E������8z��fR�a��%�e�[29�^ �pcC��Y���fVo�`'#K�~���'�E!,�M�|n�c|��A���J��V,5!"\%yLp��7�	�x4���2�`v�%��Z���ml������`IP�o?���'w�g5
�]�5�a��T�K��l��rRd�|?	�/�� �4��#���>y�:�<t�`�c=n�I��\����p�)��B��@����DK���k�`1HhQz&�V���8�W9y��A�hT]C�hsj�vQG)TjL���pj<(�����)�X�
`�U0�C"?KK!"�������;��>��]��z��oy��n��mc�����-����)K)����@��R����?�������s7
\�Q���(����R:���@���=�����RB�p�|sA��*G��4$b��e��������n	nD�
��/O�\n��l��AC��l���ik����e��ao�$M�x>6�@��5[�:�u�K���P��B���"�0�%%��~5a�_y��d��@��tTege	��8f,��)��e$��7Q�����
]�d r
4`�NX�v���#
Z�/�A��F���!��t���{*��&-��X��%�7L�Tq���R�#�K��#
5�l>������(�k�l{MrjV�gTkt�Kb!R.
��?w�<g&�����$U*�r�y,�MR$�k��a�sV��S����^�h����,H'v=��I1L�RyU��{!#@$�����.C0c�U^49Ya�PfRE����0�������)d
�EG>��4@	��*��72�NQ�@���R��a[F��!�Q�C$��x��C��e"L���#�����wO�R$\'[��h�.���MiSS�x@p���	�b��$��$U�[[#�����I�opE����+���u���I�N���D�"��&^Q����P�2�1p��`�1S���"�2:��X3�8�B�D��z��i�X"��l94��B��%�?}���=�R�
g����F�E��Yg�W�?��'��1k��� �����C��Uc�JS=�����L��]I1[P�SK�W�Zj/f�2d����")��Nq��gK���������:8����C�W0���3�V���F�b�
�MK
c7
�%?Ao!/������������B+E>u3km���"��P��P��0l����L�����F2�
��u�����/�X���_%�`.�4���:l��=���+e�� �n�)"�B�mv�����`���a����P��2��m��gC
�eCE��+W����c��OA��aASU\_��H��[5��YC�Y�������NK*�P2�XcF�
���<4Zp��>�\��K��<�K��cy�#nh���x��=\.@��\L�|4t���y��n�y��z���a�K	^��N���5j�#s������+o�2�y�clM4eE�Q��TlN�,���}=��
���/7���[y��}��nC��\b���i�[M\}���?��O��>��.2|���Y��S�\S�Y;�`�/l�K�cVag4dU�d^k���}v���rK�A�������:��~���b�D�?�\��n�~������,�y1r��]d�D%�W|v�T��0��-���
��0}���
@��3���re��0M�Ix�$��s:�R����+D��q��8��gGW���4e�_C9 MQ�P�v�G<FC�x{��8X�Xc�	���~������%q*6����������^k���-k�v���$���|l~N���f�4k�#[!�'���b���a_��y�6B��"P��/t#��iz�)��-'�%���n��s�����B���ruv��t�x<+��6���hP*m�eu+�v��>��6`6N\����i@��P6P�O^��v���IT~CK�����n�5(�v���_�����D�	����\�o�FNs������D���VSt��~����������O��|cX������R%�X��dgp*X�iU*��_s������D����KO8��uP�.1��uxI�t�P��w�(fT8�U��j����h����uA��y]��dE��@�=����`0�:2z��\U\dZ�h��tL�S	�����
�e\����$�U�Q�l��]W��0��[��S��;",U��8U��a1j���^���;U���6�o��t{����	�)�l��X��;E�����Q;
H?0�����Ul��C��2�rNTyA
R�:�<@�%���|��2�Z�����p'M��T���>$�~�'O������i�,��$��o��\���VF�L����a��<t�3��|��-���_����j�s�`���g-�0k���a�Z�.������V�<���n�'��F��ea@����*
jo��.E������cl�c�[�7���rC3�����Vacq@�5M�F-��:��w/�2Ql�c.a���`js�.p�srD��T'�0a��U���.�%������;:�g�8�Y"�c	�h�
��%S;��x�-����p�^�@�Gf����|I�S��zL���k�
k����Pj'�j��z1K���uV�����_��}u~r��XS$HZ�1S�!��Apa���{k��Qq�;�}�.���?Wl��&�g��������n����v���]����@�����R�.Z��y.O��:�~�:��B��C�}�|E���$^V&h��x^|�����~������f�q�y���/�w�m��
H����:3`l�D���n��f4�������~w�����n��\�v<�+����6����Ds:�����C�yY64 ���I�2u����]���nRW=����
M�j��K|��>�N�5�)�:T=�����������|
����G�m)S�XA�i�A�b�fo�d,��QT�����������}�D�*������/??�S*�*�'���!!B����>87�D���T�*XC�\`����6|3s�b��7
�
�nx���y3$'�o�,���*T�r����^�kY���B�.�/�,TD��F}�8^����l�$�\��!]�ApG�Q��Ki)����.%�i�b���q��)����9%��+i]n����`��L�'��Lo�<�w���/%@�{'$u3T�&e���M�@�B�����V�9������_Cl��KG��Y���8#��TUS�h�a���hy�"�0r����k��2"C�(?\!*n�eZ���g�c
L��p4���e���7��^T�uD[�qTE((-���s��s��0�����D'C�*�c]Z8�������j?,�/�&+�e� �����������1'6-F��v
{�����U"�`jJ^V���	�R�7������^����<���n�<�2����N�K��Z��,DEs�=�u�CZ�5�����B��d����!��n����2lG�����y�_���p�V�XrR}1p�L�b��O�.m�xVk�g���`]�V�Aa<��F[�tD������ �;��2�\R7�i���6��j�>�5)�*�MvyG���'f~��v0�Q�0��
����W��B����V(�#i�bHH=�A6Au���:��}��AH���r�#J���[
~/-�����P>��o	�+�*��x�NF������"i�3,��_��4-���&���*ZCj��h���m��m%�c�����U%�?���
1w�~h��R���R�v��X���`JaC+�1�������g�s��J���d���+x��/s�`�
������|l�Y���x��������(��}kW�H��g�+�$�6��m�c���N�@���;�K�d��m�-���������JURIy�����K��z������8�K��W8������S[Y��l��|��~����x��d��09���z3>�����5|��[m���;�
@����sM�Z���K\�;�M�}�V��Xz��(��@q�GP�v���g�E�f3.1�B�8�Pc�	)�����|�,�kw�f�pTA���cg	�#��CC��:h���0�rX��C,��Rj!��q3h����[�K����S�5��5�m������w�L@�M>�F�I$P*��L��'��^%���Y�/��O'3�)�N���9O�D���jA��k�>�l��cPl��JAU%2�F���[I-}�?�w�\��7��{���������(K6P�l��=�[�x�'�N�9���x���i����E�0+	�xH,�������PY&N��`�\Kb��5���gna�f����@5�Mf���������Q�[����d���s\���!�L�@}?P{Mq��
�h����Y��8;�����|�f6y���5x���������"�ZV�`<4�����Q6X�;6�N�kz}�=��n��k{��T�)�NX��a���f�R��^_C%�Lv�gK��3s `�GNDA�}yr��H�h>�������So����w�����?��u4�u�%
!N0
�b�k?��__y��~y��Y�r�����Tj
-���>`�(��gUV��jC� ��D��$O�9��� X��B���d���CKL����%�z�3*6"����r[l+���(S:+��ND��H[*�����G��4�����0�w�F�s��i%��^��{�;����� �<Nv���D��G{{��������B��l	���-�wHmQ��4��0��18������(�wg���Lr���^�������7�{-<������6 F�y vT���T�S��P��#A��,Y*�[���<���\wh�]��d{C?)"���x5
0x+�V����?����G����k$�]A���vd�v��������(fn8�d��>C*>����&�%������O�_9��/���3��	��H��YC1�6R�nF���8��]���Jn�G����=��l�W��1����x�c������x�g����"2&��]��Nz������O	�kn��u�����X�*���^�
�3�p><Kg�~���O*�����7�%���y�[������,�0��g�!�_��Mp���`D�M��S�����*L1���O�
�f�1������)���H�_�u\I�������2�^���N�d*l���	-}������I�ps�����K�w���*F���n6}X����k�B������n?�,o����H�X	�S���
����vD�@a�G�����U�m�����
����L���X�p
bt��biO���N"�A����}�����,�c�>T���M�L�Q��"r������<n�;ts6�<��X���C�=�2S��i6������~��rQ"
�F(8v�(9�/��KWp��%�a��W!�1�kS�����s��m��>z��k������Nko$���3n��s��z���V/w��	\�}������i�U�Cl�CY�:���s��v���w��k�����G�S��h���� {)�j��[��M�YB%������F��j��B�����_��>d!���4d���z�w��@���t�Q�b	y8��=�����v0�2���[@Z��C���LQZ%�.�~�����T��"@K[5�B�����.7��*�����>�s��^������'T�q�q3��c������v���-U}!� �����UC�"BXB^K�t��OV1������/��uN�>X�5v�6�w3�b@i�m�����Z��bv�c&��g������u�l>f=��m>^��� ��n�o����Y.���5i�Z�f�H�NQ$3F<���\z��:������z�6Z����Aa7��7�x��iR$!��S�e]�5��R�-������@.R$MJ�Z��?����l[����sC��1*IF���a���19n�S���`�r9M�NTm,���Q�#(����������u��(dC>���e���?�������{c�E>�,
d���k��k��RW,�� z����Z��a�!�2������!�S;��lM�>�u�����G�r;���y�����O��@s8���7���_����r1�-BZF_G
��&g��@)���A��U�;R�������Z#�=�am����ELB����K�xX����A	��d4���H�����=_
e�5��`���Uu�����9�P��gb�x�A��<h6~�N�*�9�1G�$�_r�5/N'� 0���<�B4I�sJ6]M�!����AT��j������`�!9�X������M���cqN�,���LPh�9�c��e���3�$�_��I�	Ty-|���y���[
��*�{k�7A���o��F��p���{��^������m|��@A�i/!���Kl����4��a�f��	=���������_W�9��ZsW���SVHT�?���D!�����R#){��a��f;?RP�����Zny3D:'��D�I�(����!��T�BB��d.�1I�8�L�"���W��F�?�S��[;/0?P��r���B__�8�W��l7S����^�G������~��c����:�Qg��o���^���k4z�����������_���Z�*�Q���Z�{��S�x+��
15��U�:���X�7�*K���T���Y�j#%uk=�"��G��v���'�f�q�-H?y^���y���%��C���1�����Z2-"xDp���2
����k�H�+k��g.s
�����I�l���h��u|5��N#q<E�".��s�`.��I��y;7��+����U���H����M��:���$��zs�imE\���2�K_��`:�%�]!�����xI)DXm-�t�d�����M�e����/�|�;(	�n��+n#�S"1X���r��ysz�EY��9��|O'������a����>/��s�03��.{����`D�\w�	�x9������2+��"��]f���_��u>l�Z���OL��zA����|��5`w�����L	�f#�����$�*,@\H:���e��f?���]�k���?2��H'P�����u��
N]�<>�tP7�o��~�y~~�J�_Ez��8>?����x!"!�H�(~���S��xw}�����75zLRR����xW�����O�PqX�����E��},� )[�� �l���0@���H�M���y����n'�u����$�p�3V��X�o;���E��������������� /|�,:P���w(e���J�\��J
��������/���3� U=���37��8��2PM��.�U	�OX�lTa��v��U��:����Z����e�HB�<.��<Z��#�2�={�_����G�����-e��j��N�D`$�����a���S�q�����85H��+��
�+��6������"���Qe����4c/��A����ng�l���H8�����T�c(~�~R-2)e�wc#c�~��[�?{��b�v���%T���I8�j��������Q����Q�w�[��j\��=�{����1l��*k
����R�\�/��<������Oz���^D���Vy�C��E:�X�F=;���HU��z�������Sm�Y��]�k��A�����G7��>%B;/%����u����WL|�"��X5'��m��*\`�m=���L��������_�;�/_��e-O��:����)�����):�C���[9U*?�9�
0�h4?����M��G��d!�gP�R
\��i�?��Rks'���l��W����}�R�	f�/���� 0A8n�?�+�+/�C{��e~��X�k��;��������]�_e��kl��
��g����_�q�n�|+.�|r�"�.�Jz��V�3r�`�l5����3pJ$w���61|	������#���b����2�pY�~}��
v@��M�Q�'�4��l'�y@���^�vy��E��n���]/��n�m{�E��9	4������p��]����:�t
��P��*��\F�
�(zW4���N��F�
��Ns,$�fP�J$�IA�]�T;�������)$�w�_(�!�	�*��oeq`(�3�mAz�J�8G�4������qs�r$�����1
��T�2�0���`�34����6�(.K��Nw�(�����3ry`@$�^,�pb�^��P������o1�fHX	;Z�m������U�[U�(v�^:��j?��q}� C�������������}dA���X��m�Rms
�5�y%i�dP�,��6�*o���!%0hIi��<�Y�<~{,�=���P�97M=}�m���fr�?{�`M�*'�8������I�U��b�Y���/�50�"Q*�;F/��)c��)�on�|��������������/�\��=&�5�y���$~��9u=!��5/g�z����oX���m74���a�f24`��v����z� �{.�,i�m�����/<����J����w�#�������Ba�����j�;���������v�V�|(p����T��!o���e]?����u�d��%&K�);>��9� E�\�NZ��i���m���2������B���=�[/(l0��lkEy�hy�oDf,;*��Sb�V�rW����r�=�<M������lx,���Zq�������v�AE�r�/���P�f@fi��UZ�*�[���y,��f }g�[�t����'��p_%#�������^��"��������{px�����������������X���61�Cq:@Al
h>/^������������7���������Z�T�K0g���J��@��{�~��r�Vk��e_��(�X��p�a@����&�87����5	��O���l\_���1J��������
6	�-V��"S�t��������c��nY�������k�!1T3U��s�}|U�xM{�}���<�nK��������AY�B��3��%9����"��w�����(�?T��d���%��n�^��a��4�$������������&��O�YB[d�Q�B�Po��ot6�2LNz�3�IK��S/)���HT�R��/�]��g���k,���/Z
J�5��rc��:�C����/@� �!�9\z.��b��7�����|x�-B8��(&�2�����3:<?���nP�E��
,�/���,;?5]-���O�O��u��Vb;��R�����<=����y0	��t�a���+�4�\�RG�J�C9�@M�3;�z����� �����_5zb�:V��g�G�!�gv����r6������Y
�"`�=���vH����,Y���O��^lJj0rp��Ur������#j	���b��^��F��P���R�r|9x�������t�I��]��\|�C���X.��(:4bAQ1$�9*_�_^MSsH�)������fw."�-�WQ�
J`����
S�F�B1�TF����
��C��u�Z���������U��!�:�4d@D/�s{V �1�v�V9�Lk5������3y�T\�l�������:�T����/~R�B��M��t,�9fw$"�,f36�b	�2��N4�����N.�o�__�7�	��'��tpt~z��?��3��^��������v_h�"G����� p���"�N�iDQS�8�k�9��pW+O��2���!����7!��F��I�T�7���t���5�M��.��`q��&����8k��G,��6Uox��cX����y��I�p�[����x�k�B>�o����8�����c0Xk�����C�~y}Q�%',�|��������L&��8����r��"aWS���B�E:@��8�@�r�&�0�e'�%op�G���l�V������
�w@��AIya��BEk��Dd�O�3�Ci
� 0d����T��$7%��r�M�������R�P�p�)3^$��T��q���J�'�X��L��&����������\#�������u������O�'�/�����]^T��v�	�y\�6$kCmg7�EQ�qB��c�����z��]�����s��K���k�>a��������������&:��SY��t�?�;2~D3$4y�<L�Q�:xx�I��V������!Q$J#U��?P�j]��g�'B�!xq
�E��q����&��>M������;�%)������55W�>�`g�|�4��7���fX��'H!�	L�����^,(r�3���H�6��O���hz�N���Y�iH7]��I�/@9~����@�B��^q�\dEj�A�a���f�\keY�����z��~����H�d����I}p&�`��C_��@��z�����@���F��������t��������2��e�s��M�Y�����{K�L,��(���tq��`
������.�;�����FV�_��Z�����KtW��;u[P�~���%��!9$S������Sw����n:����:;��0|;,c�sW^ �,�\02$Hb\(M�%xe�$F5�p���<��|���b�&���!K�������b�	�7�������K��u�-����
��mkJ��|�z������`���bY��\��`<��UuSZ��z�/��Q�����&rK�����Ml�f���yH��OF�
'�X�E��E)��q!��,�T�|Z���'��|����yH�9c)�����Q�a�~=��3D��%������`zk�P0��(��	f���o��l�i%CK��h��]�;�t��~��"N�x������
�����-���V!�C���j
�P��eUeAWLu���Cj�<u��0�TF�@�_���2�**�~CJ?���LS ���h)]j�Jk��J��$����'uE�X`�Rm-}�eo�����(����W\6��c�zg�	$kw�-[�����c�jJU��R_dQU
�k/�kL���^�������������y���7�nq��)��9g��/�@ t���u}�	�w�AJ�v�9�<8�|s68:>8zyz�__1iH��x����H��(G	���������.(�iS��I;JI� 3�[-C�/W>��n���;\��K�V�#�/h��u"�8B�@E�t\���N�B�9�����N;ej���b�#�V�2�,XM����N���bU_��yT�����m�]�x��,�Z�k�Z�%<J���99L���a�\�������	���t��a�]�����AR����[���~>K�-p��T�q�_��~��"u�w���8�7��`��[]!��tF��s`�6��OW��J_�����@�.�:�������*=�Zti�-�)�>V�3�����������B}�Jf��	b;$����5;�{s��4+��t���a�ixy��������q��	�l��Dje�8����P���4D�����i�[���b��S�C+�����	o����\���`�3��Z�L�Rp�����$X��B]%�$��8�f��B*]`���8����NV����u����^2��=�J2
�2�^�%F��A�%X#�.�������R�xW��e�ZLb�G��q�tDj�6���I9�H|u��W���cj�3M�>�T���J}������dJ{������!`�������>��d�<��,�b��������O���V����#sY�{�=������WU2�����#��f�\V�x���2x���!��Y���=������2��Yh/������J��������Q��GE8[v���T�q{��#9a[�|Ee�q�]��>���[�0hy27�k������U���S]������7����mW	�RK����v�L?p�k����;w\�&�m�db{���8zo��xM���-�<X��������w�����kQ������;����\��i��VAA�l�q[+��`��,�� �����SY��+5tZ� ����+5�Qt�����5]K�A��n]�W��/.X��(�����p�g����v�O e��:n��2��6���&)��[u��U���Ti��3[�_?[��fK�
L��MX���Y��	���E�X�m��r�L���3O�{M�8Z/���^���n��i�zA�����n��u�O�n�Y�=�����;�X)���v��J~�+���
GaC18�.k������IxLA=K�����-�����O��M���+�]mK#��Z[��up�K#�*X{R�n��e o>%
im�����K�CXE}p�bzR�j����m���a���\w���[#K��B ���6��p�jK�D
f�Tm���k��@L/n!n�WK�T��TP����4"�"^%���>�����������v}1�CK��|�K��!�*����~'�p)w������P6���.���^-��v�G����fY��f�7�G�Bjy����]�4+-���<}��=S��R���;,�����{#�u^������N�f�Z�$��j�mHB����4�>DCP�h�[6~�_�]��~Z��^+���M�k�����O�i��X�0�����������LX���GI���^��w^w.�v`�����"S]|]
����z��H��h����?����m4�{�a�����Q��4:���������4;�?�4���Q��
�?�:e@Yk�z�a[0uE����`P�w�@0>\x��s������N�;0����3~�oF�8U@$3�C����:Z�oN�P��<XLv��T	��;�:��?�������.w��������9a��$y�����9$����%��)�5,�!���M�4S��Q���2f#�r��������F��T����y���>�	�v���d����b��mH&�	��-����q�-��W�e���y��H���:�{�yJ3�f��!|�f��q�#���I5����)��i1�tu�W�Be����p������] ����>��<��y�u���y��m�H= D��Z.�`�`��������7��������f����T��SY�9��EWv��DyU����4�eS��O]T�Wvvh�X���^\y*����\i��I��;�h������I���	�wU�*� �����]"��t#0�����UmFl4��-��,�P�vrK.��lN�1����W�#�&�flM&�� �*�[��!�����%f�u�'���K�A�X����R�5*���],q%Y���~��������#�CF?y�%]�=Z�T���X���\���G���0a�_j���G2f�D4��'`If��y{�`0����U0������ee�����	^�.Cq�E�d>�v��Q�_���|@�O�9�S���1�U�8''s�*72��g�hG���,�S�	XF�Nm�l�vR���O���I��cF*����{^���w�������l5�d��;�h{�V�p
�������}+��U�w�Y�Z�c(�7��"�/x��n����{A�����~�]�I�pA?l�����p����Q�0D4�4{|���S��������d_�p)������\0���b�	'���~�4\���'!7#�zO�G?���bp�Z���=TS9��|���}�P�y~}'T!��`�����/��*�2
2iG%@���L!Q���L���a����U(;��[��"�Xs�1�d+����=X�@�g�p��?�>w�������Wa�2��Y�u�t�u�l�������1>��T�Fl��
X����xz<2��K�����s��o�of3	z�����
+'D`���=so[���U��C�V D6p�y�4j��4�������n>O�h�)'���.���K���$��@��Sm�O��_��:���4�c�
��U�`��jV�&��F1��z���0���Lko��g57"A�sL"ASO[C�����6 ����8
��WI�0�L������O�X9�����^W�����\��)U�����U��VF@ >#�C�|c[����T+i�%��{I'ah����F%�EjR��J�y���Dh�<[�!O����%^�$�(��ldc0K�����&�0��L7��7Us#<Y=���R3���o��L��������'N���m��������T�"�����7��"0�s#LR�����9�USd~�D��)1�T��J�9�GW+c��p�b>��Wo�XI�g��C���r��9'�/U�z�e�	o���V�}>f1��cBr|��1Dz�Z��P���?Q8������h��u����
���R�k�d@�;�JV0 �tw��+MLj�1�$os�B2��>�|��D��
-�"�z���N�����J�	�lQ5ER( Z����Gf_���!���J�i�,�f���F��2H�}��{����+&��?$�]	�W�d��q�c���KJ�=[`�$�i�;<�1\�G��WTO��$��0���W��"�a8(!�L���G���a���q��\�T����%f���b)���W11��G�D��A S�1w�/�(����4��y��t�R"����0�p�z}�.����Jp+o�@Y�MV�j�y�Rp���,#i���0<���i���#f]�w�I�*�����[�j���dr��QN��IZL$M�E[���R��A��J�qF5r��3 b2'U��N&Oe���11[�j��P5�I�zH����B���l�/��)�T�����i�nm?��E�,�_\(}�&��a)������W�D@��h��v�jS�d9���Ny�.T��!��j(P	3��J����'�|ZnU�BI���������K������
��E��b�Z	�Pq@l��e!W `���;Vy{p�����������w#y��\#����B����w�Uyi3M�jH��X�]����@!��INE�0To������Ju�l���xy��z{���h��h  O��a:R2�VV�@����F8����DU'9*V����"S�q�.�"C� z�'q%�N��<;�����mZ��j(x7�\�m�}'���U�x��@��+����j#��><}����s�e��|  ����m�uI*�Y��L������M*w�����7?5�<N�b)	�8��@bb�����O �"8�����3���<e"��:N��$w�*��s��}��8��x��z=�d���)��*���\�K�?�S�;�nBu�G���)d"��|s|!��|2m���f�y��2���t>�>T��K�����	&o��*�D�b�@hc+]{(m�>���K;�x�}�AK��?�Q�>E��3�����N��rI��z
������K��K�>3�4����i��!��e�cJK��:���0
y��}(s��y�G>�~F���9������;w:�`��:����Y8FeR��b���6if�@(%;	z�$����[�Y�w���r���N\L�T(�;�ex�*r��!�b�}dw	���6"��H0����l&)�}�����qs�&%�O�hV)�]`0o��!U{Hm���|�h�O,�y,fD����4����lG�j�
��F4�3q�C����5� $�����{C��K>-��_����6|�D��n�d�C%�S��k�A��s�!�o^�3	�`N�CJ��kN�
�#�*���P@��$��m~�B-��2�$o����x
=����L���� g�`�4qG��@�e��&�Vj��P0A�
��\�'����U|�8�����q��U��,��\�f�79�,�0�C.�pQ�J;>j���	�[����_7�$�[�X�!v.��Z�mVQ��<q�}(�����I8�������N�n>�������~��d�Tv���oW���vUy���(A���y�o"v0u.�\
�����18Pu5!�_�������_AK �xo�������2��`�����A�#�2��[*�(���(�;	<2"�Y�QD��{D�Nj=s���wV�M<�#�}��8�Fem�>b�6���O��u�(i���L`z�)���a�_���]��v���^4�H��G��
U�)��/g�d�-�7[6�#���lLE�Y�>j��Zw��-�v����-�CK�=]\�nt.)����7�gX8��AHc
C� q���.��.��]n��������n+�b����y�D���wAv6b	���CD��J�,`y�\�qd)��;q7P����[47P@��DoVFa�H,zh�}*�������j�n[4�/���D_-,xg���q1�w
�J�0]����U��������,��D��~����\q���l|��Fda��}��J :[���0r�2[)���
+P�-�J�����P=A�k�(��ra8���$��^�0���uD�8T�R�b��e���&F�K$1�����}K����{�����)���0��������8��Y�����{	t�I�$�H8�;9�~������ExHx��$[��LFz!���c���_�������^CQ[�zPv:���`�F>�$��[�H�������'#��.����|f�Yp���i8z�>A�Gx��G?-~Z|u������G��.������{�����(�5~�:��C�kt�&����JW��k|�Se��I�;��\pV�!�����/�d�YR�Ng���,�j����j.k���#���
�S��|��2�x����q�(n��8p�X1�����H����i�V�AO�V�T"� I���y�1f���\�W��[-s�Y��;��-<k�G�F��EX�#R��)�e:�?	%n
J,�A�
�������������F�H��7s+\�+*��M6w���Eu�n08_~O��	����$����$}����p�6�]���;�q*����@��%�5���%0����B��L�JD9r
}J�!32&jv�,�q7}!� ��+�t_�3�b��BS�&�~��Y �|;�+i�^
U�?������F6	���L����M$Y�����LW�$-�S)%�X�t:�Z��%�<S�$�l��83�u���K=`�������e�����.�H��g�@���h��N�;X] B�OR��;�0Z-�9A���8�g����?;�v�G�����|�����^������������1���h8l���?G����(�b'�e=o� ��l������s��F�F_M'���l��E��N��#��85�wA����������7
���5���^��u�^�9l�Re�C!�����F"e�_��)���;�8�2��$vv�8��0Z<(���i���u�������j��.=�x��<�����~���A~^I��&1���pkOw�\Y$���c�����������M}�:��N��!I~��V0*S���`_�x`&��������������T�O����M����6��m�K�Wr�H=����s_4K�������3���jL+}H��h���B�fJ;,r��������qyYy��-�L������5��������
z���2�'�eggc�*��%�$0K��j�����$�&�s:c`<��<�A�����I�j=�?���%��� h��D���@����0	w��DCq4�(��)x�(NJ�_;���t�N<I�	�e�����%��p��(�hw�������a���Lx�9)�yj������b������&eL#~��O����~��?>?>?>?>?>?>?>?>������)@

#85

Craig Ringer

craig@2ndquadrant.com

over 6 years ago

In reply to: Andres Freund (#20)

Re: Minimal logical decoding on standbys

On Sat, 6 Apr 2019 at 07:15, Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for the new version of the patch. Btw, could you add Craig as a
co-author in the commit message of the next version of the patch? Don't
want to forget him.

That's kind, but OTOH you've picked it up and done most of the work by now.

I'm swamped with extension related work and have been able to dedicate
frustratingly little time to -hackers and the many patches I'd like to be
working on.

#86

Craig Ringer

craig@2ndquadrant.com

over 6 years ago

In reply to: Robert Haas (#83)

Re: Minimal logical decoding on standbys

On Tue, 1 Oct 2019 at 02:08, Robert Haas <robertmhaas@gmail.com> wrote:

Why does create_logical_slot_on_standby include sleep(1)?

Yeah, we really need to avoid sleeps in regression tests.

If you need to wait, use a DO block that polls the required condition, and
wrap the sleep in that with a much longer total timeout. In BDR and
pglogical's pg_regress tests I've started to use a shared prelude that sets
a bunch of psql variables that I use as helpers for this sort of thing, so
I can just write :wait_slot_ready instead of repeating the same SQL command
a pile of times across the tests.

That reminds me: I'm trying to find the time to write a couple of patches
to pg_regress to help make life easier too:

- Prelude and postscript .psql files that run before/after every test step
to set variables, do cleanup etc

- Test header comment that can be read by pg_regress to set a per-test
timeout

- Allow pg_regress to time out individual tests and continue with the next
test

- Test result postprocessing by script, where pg_regress writes the raw
test results then postprocesses it with a script before diffing the
postprocessed output. This would allow us to have things like /*
BEGIN_TESTIGNORE */ ... /* END_TESTIGNORE */ blocks for diagnostic output
that we want available but don't want to be part of actual test output. Or
filter out NOTICEs that vary in output. That sort of thing.

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

#87

nil socket

nilsocket@gmail.com

about 6 years ago

In reply to: Craig Ringer (#86)

Re: Minimal logical decoding on standbys

Sorry to intervene in between,

But what about timeline change?

Thank you.

#88

Amit Khandekar

amitdkhan.pg@gmail.com

about 6 years ago

In reply to: Craig Ringer (#86)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Thu, 10 Oct 2019 at 05:49, Craig Ringer <craig@2ndquadrant.com> wrote:

On Tue, 1 Oct 2019 at 02:08, Robert Haas <robertmhaas@gmail.com> wrote:

Why does create_logical_slot_on_standby include sleep(1)?

Yeah, we really need to avoid sleeps in regression tests.

Yeah, have already got rid of the sleeps from the patch-series version
4 onwards.

By the way, the couple of patches out of the patch series had
bitrotten. Attached is the rebased version.

Thanks
-Amit Khandekar

#89

Amit Khandekar

amitdkhan.pg@gmail.com

about 6 years ago

In reply to: Amit Khandekar (#88)

Re: Minimal logical decoding on standbys

On Thu, 7 Nov 2019 at 14:02, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:

Hi Amit,

I am reading about this feature and reviewing it.
To start with, I reviewed the patch: 0005-Doc-changes-describing-details-about-logical-decodin.patch.

Thanks for picking up the patch review.

Your reply somehow spawned a new mail thread, so I reverted back to
this thread for replying.

prevent VACUUM from removing required rows from the system catalogs,
hot_standby_feedback should be set on the standby. In spite of that,
if any required rows get removed on standby, the slot gets dropped.

IIUC, you mean `if any required rows get removed on *the master* the slot gets
dropped`, right?

Yes, you are right. In fact, I think it is not necessary to explicitly
mention where the rows get removed. So I have just omitted "on
standby". Will include this change in the next patch versions.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#90

Rahila Syed

rahila.syed@2ndquadrant.com

about 6 years ago

In reply to: Amit Khandekar (#89)

Re: Minimal logical decoding on standbys

Hi Amit,

Please see following comments:

1. 0002-Add-info-in-WAL-records-in-preparation-for-logical-s.patch

--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@

#include "access/hash.h"
#include "access/hash_xlog.h"
+#include "catalog/catalog.h"
#include "miscadmin.h"

The above header inclusion is not necessary as the code compiles fine
without it.
Also, this patch does not apply cleanly on latest master due to the above
line.

2. Following test fails with error.
make -C src/test/recovery/ check PROVE_TESTS=t/
018_standby_logical_decoding_xmins.pl
# Failed test 'physical catalog_xmin not null'
# at t/018_standby_logical_decoding_xmins.pl line 120.
# got: ''
# expected: anything else

# Failed test 'physical catalog_xmin not null'
# at t/018_standby_logical_decoding_xmins.pl line 141.
# got: ''
# expected: anything else

# Failed test 'physical catalog_xmin not null'
# at t/018_standby_logical_decoding_xmins.pl line 159.
# got: ''
# expected: anything else
t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed
out executing this query:
#

Physical catalog_xmin is NULL on master after logical slot creation on
standby .

Due to this below command in the test fails with syntax error as it
constructs the SQL query using catalog_xmin value
obtained above:

# SELECT catalog_xmin::varchar::int >
# FROM pg_catalog.pg_replication_slots
# WHERE slot_name = 'master_physical';
#
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR: syntax error at or near "FROM"
# LINE 3: FROM pg_catalog.pg_replication_slots

Thank you,
--
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#91

Amit Khandekar

amitdkhan.pg@gmail.com

about 6 years ago

In reply to: Rahila Syed (#90)

Re: Minimal logical decoding on standbys

On Thu, 12 Dec 2019 at 15:28, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:

Hi Amit,

2. Following test fails with error.
make -C src/test/recovery/ check PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
# Failed test 'physical catalog_xmin not null'
# at t/018_standby_logical_decoding_xmins.pl line 120.
# got: ''
# expected: anything else

# Failed test 'physical catalog_xmin not null'
# at t/018_standby_logical_decoding_xmins.pl line 141.
# got: ''
# expected: anything else

# Failed test 'physical catalog_xmin not null'
# at t/018_standby_logical_decoding_xmins.pl line 159.
# got: ''
# expected: anything else
t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed out executing this query:
#

Physical catalog_xmin is NULL on master after logical slot creation on standby .

Hi, do you consistently get this failure on your machine ? I am not
able to get this failure, but I am going to analyze when/how this can
fail. Thanks

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#92

Rahila Syed

rahila.syed@2ndquadrant.com

about 6 years ago

In reply to: Amit Khandekar (#91)

Re: Minimal logical decoding on standbys

Hi,

Hi, do you consistently get this failure on your machine ? I am not

able to get this failure, but I am going to analyze when/how this can
fail. Thanks

Yes, I am getting it each time I run make -C src/test/recovery/ check

PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
Also, there aren't any errors in logs indicating the cause.

--
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#93

Amit Khandekar

amitdkhan.pg@gmail.com

about 6 years ago

In reply to: Rahila Syed (#92)

Re: Minimal logical decoding on standbys

On Thu, 19 Dec 2019 at 01:02, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:

Hi,

Hi, do you consistently get this failure on your machine ? I am not
able to get this failure, but I am going to analyze when/how this can
fail. Thanks

Yes, I am getting it each time I run make -C src/test/recovery/ check PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
Also, there aren't any errors in logs indicating the cause.

Thanks for the reproduction. Finally I could reproduce the behaviour.
It occurs once in 7-8 runs of the test on my machine. The issue is :
on master, the catalog_xmin does not immediately get updated. It
happens only after the hot standby feedback reaches on master. And I
haven't used wait_for_xmins() for these failing cases. I should use
that. Working on the same ...

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#94

Amit Khandekar

amitdkhan.pg@gmail.com

about 6 years ago

In reply to: Amit Khandekar (#93)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Tue, 24 Dec 2019 at 14:02, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Thu, 19 Dec 2019 at 01:02, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:

Hi,

Hi, do you consistently get this failure on your machine ? I am not
able to get this failure, but I am going to analyze when/how this can
fail. Thanks

Yes, I am getting it each time I run make -C src/test/recovery/ check PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
Also, there aren't any errors in logs indicating the cause.

Thanks for the reproduction. Finally I could reproduce the behaviour.
It occurs once in 7-8 runs of the test on my machine. The issue is :
on master, the catalog_xmin does not immediately get updated. It
happens only after the hot standby feedback reaches on master. And I
haven't used wait_for_xmins() for these failing cases. I should use
that. Working on the same ...

As mentioned above, I have used wait_for_xmins() so that we can wait
for the xmins to be updated after hot standby feedback is processed.
In one of the 3 scenarios where it failed for you, I removed the check
at the second place because it was redundant. At the 3rd place, I did
some appropriate changes with detailed comments. Please check.
Basically we are checking that the master's phys catalog_xmin has
advanced but not beyond standby's logical catalog_xmin. And for making
sure the master's xmins are updated, I call txid_current() and then
wait for the master's xmin to advance after hot-standby_feedback, and
in this way I make sure the xmin/catalog_xmins are now up-to-date
because of hot-standby-feedback, so that we can check whether the
master's physical slot catalog_xmin has reached the value of standby's
catalog_xmin but not gone past it.

I have also moved the "wal_receiver_status_interval = 1" setting from
master to standby. It was wrongly kept in master. This now reduces the
test time by half, on my machine.

Attached patch set v5 has only the test changes. Please check if now
the test fails for you.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#95

Rahila Syed

rahila.syed@2ndquadrant.com

about 6 years ago

In reply to: Amit Khandekar (#94)

Re: Minimal logical decoding on standbys

Hi Amit,

Can you please rebase the patches as they don't apply on latest master?

Thank you,
Rahila Syed

On Thu, 26 Dec 2019 at 16:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On Tue, 24 Dec 2019 at 14:02, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

On Thu, 19 Dec 2019 at 01:02, Rahila Syed <rahila.syed@2ndquadrant.com>

wrote:

Hi,

Hi, do you consistently get this failure on your machine ? I am not
able to get this failure, but I am going to analyze when/how this can
fail. Thanks

Yes, I am getting it each time I run make -C src/test/recovery/ check

PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl

Also, there aren't any errors in logs indicating the cause.

Thanks for the reproduction. Finally I could reproduce the behaviour.
It occurs once in 7-8 runs of the test on my machine. The issue is :
on master, the catalog_xmin does not immediately get updated. It
happens only after the hot standby feedback reaches on master. And I
haven't used wait_for_xmins() for these failing cases. I should use
that. Working on the same ...

As mentioned above, I have used wait_for_xmins() so that we can wait
for the xmins to be updated after hot standby feedback is processed.
In one of the 3 scenarios where it failed for you, I removed the check
at the second place because it was redundant. At the 3rd place, I did
some appropriate changes with detailed comments. Please check.
Basically we are checking that the master's phys catalog_xmin has
advanced but not beyond standby's logical catalog_xmin. And for making
sure the master's xmins are updated, I call txid_current() and then
wait for the master's xmin to advance after hot-standby_feedback, and
in this way I make sure the xmin/catalog_xmins are now up-to-date
because of hot-standby-feedback, so that we can check whether the
master's physical slot catalog_xmin has reached the value of standby's
catalog_xmin but not gone past it.

I have also moved the "wal_receiver_status_interval = 1" setting from
master to standby. It was wrongly kept in master. This now reduces the
test time by half, on my machine.

Attached patch set v5 has only the test changes. Please check if now
the test fails for you.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#96

Amit Khandekar

amitdkhan.pg@gmail.com

almost 6 years ago

In reply to: Rahila Syed (#95)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Fri, 10 Jan 2020 at 17:50, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:

Hi Amit,

Can you please rebase the patches as they don't apply on latest master?

Thanks for notifying. Attached is the rebased version.

#97

Andreas Joseph Krogh

andreas@visena.com

almost 6 years ago

In reply to: Amit Khandekar (#96)

Re: Minimal logical decoding on standbys

På torsdag 16. januar 2020 kl. 05:42:24, skrev Amit Khandekar <
amitdkhan.pg@gmail.com <mailto:amitdkhan.pg@gmail.com>>:
On Fri, 10 Jan 2020 at 17:50, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:

Hi Amit,

Can you please rebase the patches as they don't apply on latest master?

Thanks for notifying. Attached is the rebased version.

Will this patch enable logical replication from a standby-server?

--
Andreas Joseph Krogh

#98

Amit Khandekar

amitdkhan.pg@gmail.com

almost 6 years ago

In reply to: Andreas Joseph Krogh (#97)

Re: Minimal logical decoding on standbys

On Fri, 17 Jan 2020 at 13:20, Andreas Joseph Krogh <andreas@visena.com> wrote:

På torsdag 16. januar 2020 kl. 05:42:24, skrev Amit Khandekar <amitdkhan.pg@gmail.com>:

On Fri, 10 Jan 2020 at 17:50, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:

Hi Amit,

Can you please rebase the patches as they don't apply on latest master?

Thanks for notifying. Attached is the rebased version.

Will this patch enable logical replication from a standby-server?

Sorry for the late reply.
This patch only supports logical decoding from standby. So it's just
an infrastructure for supporting logical replication from standby. We
don't support creating a publication from standby, but the publication
on master is replicated on standby, so we might be able to create
subscription nodes that connect to existing publications on standby,
but basically we haven't tested whether the publication/subscription
model works with a publication on a physical standby. This patch is
focussed on providing a way to continue logical replication *after*
the standby is promoted as master.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#99

Andreas Joseph Krogh

andreas@visena.com

almost 6 years ago

In reply to: Amit Khandekar (#98)

Re: Minimal logical decoding on standbys

På tirsdag 21. januar 2020 kl. 03:57:42, skrev Amit Khandekar <
amitdkhan.pg@gmail.com <mailto:amitdkhan.pg@gmail.com>>:
[...]
Sorry for the late reply.
This patch only supports logical decoding from standby. So it's just
an infrastructure for supporting logical replication from standby. We
don't support creating a publication from standby, but the publication
on master is replicated on standby, so we might be able to create
subscription nodes that connect to existing publications on standby,
but basically we haven't tested whether the publication/subscription
model works with a publication on a physical standby. This patch is
focussed on providing a way to continue logical replication *after*
the standby is promoted as master.

Thanks for clarifying.

--
Andreas Joseph Krogh

#100

James Sewell

james.sewell@jirotech.com

almost 6 years ago

In reply to: Andreas Joseph Krogh (#99)

Re: Minimal logical decoding on standbys

Hi all,

This is great stuff! My understanding is that this patch guarantees 0 data
loss for a logical replication stream if the primary crashes and a standby
which was marked as sync at failure time is promoted.

Is this correct?

James
--
James Sewell,
Chief Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
P (+61) 2 8099 9000 W www.jirotech.com F (+61) 2 8099 9099

--
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

#101

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Amit Khandekar (#96)

5 attachment(s)

Re: Minimal logical decoding on standbys

There were conflicts again, so I rebased once more. Didn't do anything
else.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v6-0002-Add-info-in-WAL-records-in-preparation-for-logica.patchtext/x-diff; charset=us-asciiDownload

From a89d06fbca4b350192ad7895796783bf50a17757 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:16 +0530
Subject: [PATCH v6 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     |  4 ++++
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 90c46e86a1..43ce813a50 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -342,7 +342,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 671b5e9186..c4c5f41a2a 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -171,7 +171,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.giststate->tempCxt = createTempGistContext();
 
 	/* initialize the root page */
-	buffer = gistNewBuffer(index);
+	buffer = gistNewBuffer(heap, index);
 	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 	page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..62b4e9e8db 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -806,7 +806,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -850,7 +850,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d3f3a7b803..90bc4895b2 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967..e6cef08330 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 29694b8aa4..bf6bfe69dc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7145,12 +7145,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7186,6 +7187,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7236,6 +7238,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7266,7 +7269,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7276,6 +7279,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 03c43efc32..b1014ef7b9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -659,7 +659,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 0a51678c40..89f8955f36 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 39b8f17f4b..22ea714382 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -32,6 +32,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
@@ -760,6 +761,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1210,6 +1212,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c..f97597dba1 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -502,6 +503,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 27bbb58f56..e663523af1 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -1919,6 +1921,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..eff9696b18 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -439,8 +439,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -478,7 +478,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 55fc843d3a..390b8b65f6 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -48,9 +48,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -96,6 +96,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index d1aa6daa40..b44892cc48 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 95d18cdb12..5edeb99808 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -237,6 +237,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -252,6 +253,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -332,6 +334,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -346,6 +349,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -395,7 +399,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -414,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 347976c532..fe69dabca0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -186,6 +186,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint32		ndeleted;
 
@@ -203,6 +204,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 63d3c63db2..83121c7266 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 4e646c55e9..ab0d16c1f9 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -132,6 +132,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 44ed04dd3f..b6dc5dfc7d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -316,6 +317,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -580,6 +584,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.20.1

v6-0001-Allow-logical-decoding-on-standby.patchtext/x-diff; charset=us-asciiDownload

From e2f5cedfc35652f198f9ca3bca9a45572f8845fb Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:15 +0530
Subject: [PATCH v6 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index de2d4ee582..a666b4b935 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4967,6 +4967,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3abf8..3a072af75b 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -187,11 +187,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..03463719f8 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -91,23 +91,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -240,6 +239,12 @@ CreateInitDecodingContext(char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1cec53d748..00aa95ba15 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1016,37 +1016,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 76ec3c7dd0..9274fec10b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2819,10 +2819,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..6228779df6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -298,6 +298,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.20.1

v6-0003-Handle-logical-slot-conflicts-on-standby.patchtext/x-diff; charset=us-asciiDownload

From ee5c0944effd0634d82d90828c2f749e9abf5fba Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:16 +0530
Subject: [PATCH v6 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |   6 +
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 +++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |   4 +
 src/backend/replication/slot.c       | 176 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |   4 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   1 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   2 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 278 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 7626987808..aa65478b31 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2742,6 +2742,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
      <entry>Number of queries in this database that have been canceled due to
       old snapshots</entry>
     </row>
+    <row>
+     <entry><structfield>confl_logicalslot</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of queries in this database that have been canceled due to
+      logical slots</entry>
+    </row>
     <row>
      <entry><structfield>confl_bufferpin</structfield></entry>
      <entry><type>bigint</type></entry>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 90bc4895b2..289176305a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -414,6 +415,7 @@ gistRedoPageReuse(XLogReaderState *record)
 
 			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
 			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+												xlrec->onCatalogTable,
 												xlrec->node);
 		}
 	}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 3c60677662..5767890fa4 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bf6bfe69dc..440f444dc1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7703,7 +7703,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7739,7 +7740,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7835,7 +7837,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7972,7 +7976,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 99d0914e72..09734b3e94 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -659,7 +659,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -935,6 +936,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 7be2291d07..0672f994ec 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a666b4b935..c679f2d767 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9942,6 +9942,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..471a3e78f9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f9287b7942..0dc26dcb19 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4599,6 +4599,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6223,6 +6224,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 00aa95ba15..3a18ef052d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -101,6 +102,7 @@ int			max_replication_slots = 0;	/* the maximum number of replication
 
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -637,6 +639,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1083,6 +1143,122 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index f45a619deb..db232306b8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2670,6 +2670,10 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 65d3946386..68c438d0c5 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -558,6 +558,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 08f695a980..de64eee0e7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -296,7 +297,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 00c77b66c7..c3ad1d5988 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2445,6 +2445,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2913,6 +2916,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index cea01534a5..b4b1dbc177 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1471,6 +1471,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1514,6 +1529,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..12e7be1132 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5323,6 +5323,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '3434',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1a19921f80..1f4a000c48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -604,6 +604,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3e95b019b3..200720c589 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -204,4 +204,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 90607df106..f400069c78 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -39,6 +39,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index cfbe426e5a..7e0bc43ac4 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c7304611c3..5ed064627f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1845,6 +1845,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.20.1

v6-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/x-diff; charset=us-asciiDownload

From e5a1e2d6efb1a0a24291f7d0478b7d726c0da3c5 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:16 +0530
Subject: [PATCH v6 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/018_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../019_standby_logical_decoding_conflicts.pl | 216 ++++++++++++++
 3 files changed, 525 insertions(+)
 create mode 100644 src/test/recovery/t/018_standby_logical_decoding_xmins.pl
 create mode 100644 src/test/recovery/t/019_standby_logical_decoding_conflicts.pl

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9575268bd7..3dbfcf56c8 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2117,6 +2117,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/018_standby_logical_decoding_xmins.pl b/src/test/recovery/t/018_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/018_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/019_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/019_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..d0c449338f
--- /dev/null
+++ b/src/test/recovery/t/019_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,216 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.20.1

v6-0005-Doc-changes-describing-details-about-logical-deco.patchtext/x-diff; charset=us-asciiDownload

From 20b254f9c2379833688a46581a8bebf07280e818 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:16 +0530
Subject: [PATCH v6 5/5] Doc changes describing details about logical decoding
 on standby.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bce6d379bf..745c65ff02 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -248,6 +248,24 @@ $ pg_recvlogical -d postgres --slot=test --drop-slot
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.20.1

#102

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Alvaro Herrera (#101)

Re: Minimal logical decoding on standbys

On 2020-Mar-18, Alvaro Herrera wrote:

There were conflicts again, so I rebased once more. Didn't do anything
else.

This compiles fine, but tests seem to hang forever with no output.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#103

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Alvaro Herrera (#102)

Re: Minimal logical decoding on standbys

On 2020-Mar-18, Alvaro Herrera wrote:

On 2020-Mar-18, Alvaro Herrera wrote:

There were conflicts again, so I rebased once more. Didn't do anything
else.

This compiles fine, but tests seem to hang forever with no output.

well, not "forever", but:

$ make check PROVE_TESTS=t/019_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v
...
cd /pgsql/source/master/src/test/recovery && TESTDIR='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery' PATH="/pgsql/build/master/tmp_install/pgsql/install/master/bin:$PATH" LD_LIBRARY_PATH="/pgsql/build/master/tmp_install/pgsql/install/master/lib" PGPORT='655432' PG_REGRESS='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery/../../../src/test/regress/pg_regress' REGRESS_SHLIB='/pgsql/build/master/src/test/regress/regress.so' /usr/bin/prove -I /pgsql/source/master/src/test/perl/ -I /pgsql/source/master/src/test/recovery -v t/019_standby_logical_decoding_conflicts.pl
t/019_standby_logical_decoding_conflicts.pl ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
# poll_query_until timed out executing this query:
# SELECT '0/35C9190' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
Bailout called. Further testing stopped: system pg_ctl failed
Bail out! system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make: *** [Makefile:19: check] Error 255

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#104

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Alvaro Herrera (#103)

Re: Minimal logical decoding on standbys

On Wed, Mar 18, 2020 at 04:50:38PM -0300, Alvaro Herrera wrote:

well, not "forever", but:

No updates in the last six months, so I am marking it as returned with
feedback.

PS: the patch fails to apply.
--
Michael

#105

Fabrízio de Royes Mello

fabriziomello@gmail.com

about 5 years ago

In reply to: Alvaro Herrera (#103)

5 attachment(s)

Re: Minimal logical decoding on standbys

On Wed, Mar 18, 2020 at 4:50 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

well, not "forever", but:

$ make check PROVE_TESTS=t/019_standby_logical_decoding_conflicts.pl

PROVE_FLAGS=-v

...
cd /pgsql/source/master/src/test/recovery &&

TESTDIR='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery'
PATH="/pgsql/build/master/tmp_install/pgsql/install/master/bin:$PATH"
LD_LIBRARY_PATH="/pgsql/build/master/tmp_install/pgsql/install/master/lib"
PGPORT='655432'
PG_REGRESS='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery/../../../src/test/regress/pg_regress'
REGRESS_SHLIB='/pgsql/build/master/src/test/regress/regress.so'
/usr/bin/prove -I /pgsql/source/master/src/test/perl/ -I
/pgsql/source/master/src/test/recovery -v t/
019_standby_logical_decoding_conflicts.pl

t/019_standby_logical_decoding_conflicts.pl ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
# poll_query_until timed out executing this query:
# SELECT '0/35C9190' <= replay_lsn AND state = 'streaming' FROM

pg_catalog.pg_stat_replication WHERE application_name = 'standby';

# expecting this output:
# t
# last actual query output:
#
# with stderr:
Bailout called. Further testing stopped: system pg_ctl failed
Bail out! system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make: *** [Makefile:19: check] Error 255

After rebase and did minimal tweaks (duplicated oid, TAP tests numbering)
I'm facing similar problem but in other place:

make -C src/test/recovery check PROVE_TESTS=t/
023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v
...
/usr/bin/mkdir -p '/data/src/pg/main/src/test/recovery'/tmp_check
cd . && TESTDIR='/data/src/pg/main/src/test/recovery'
PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/bin:$PATH"
LD_LIBRARY_PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/lib"
PGPORT='65432'
PG_REGRESS='/data/src/pg/main/src/test/recovery/../../../src/test/regress/pg_regress'
REGRESS_SHLIB='/d/src/pg/main/src/test/regress/regress.so' /usr/bin/prove
-I ../../../src/test/perl/ -I . -v t/
023_standby_logical_decoding_conflicts.pl
t/023_standby_logical_decoding_conflicts.pl ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
not ok 3 - dropslot on standby dropped

# Failed test 'dropslot on standby dropped'
# at t/023_standby_logical_decoding_conflicts.pl line 67.
# got: 'logical'
# expected: ''
not ok 4 - activeslot on standby dropped

# Failed test 'activeslot on standby dropped'
# at t/023_standby_logical_decoding_conflicts.pl line 68.
# got: 'logical'
# expected: ''

TAP tests hang forever in `check_slots_dropped` exactly here:

# our client should've terminated in response to the walsender error
eval {
$slot_user_handle->finish;
};

Regards,

--
Fabrízio de Royes Mello
PostgreSQL Developer at OnGres Inc. - https://ongres.com

Attachments:

v7-0001-Allow-logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Allow-logical-decoding-on-standby.patchDownload

From a3e42f2fc53afd2bcbe6da9d029d91694a34562e Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:15 +0530
Subject: [PATCH v7 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8dd225c2e1..609edbaca6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5038,6 +5038,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..bb7c80d6cc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -203,11 +203,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7d70..ec4f7b95e4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -110,23 +110,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -295,6 +294,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9c7cf13d4d..8a0f51907b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1097,37 +1097,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d5c9bc31d8..d8308d7d27 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2878,10 +2878,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..ddc167d2c9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,6 +324,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.25.1

v7-0002-Add-info-in-WAL-records-in-preparation-for-logica.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Add-info-in-WAL-records-in-preparation-for-logica.patchDownload

From 91a987c32179a692e8af7be1646746a93d2f8f56 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:16 +0530
Subject: [PATCH v7 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     |  4 ++++
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f2b416ce1..a71d914384 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -344,7 +344,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9d3fa9c3b7..c6f0153d7b 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 615b5ade23..6169d6b294 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 91b3e11182..6580c98cb8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -616,7 +616,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -627,6 +628,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 2ebe671967..e6cef08330 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a9583f3103..65bce76277 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7144,12 +7144,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7185,6 +7186,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7235,6 +7237,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7265,7 +7268,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7275,6 +7278,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 25f2d5df1b..4829ed9293 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -718,7 +718,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b1072183bc..9819977688 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 793434c026..921aa75ac6 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -32,6 +32,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/memdebug.h"
 #include "utils/snapmgr.h"
 
@@ -768,6 +769,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1323,6 +1325,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f9..bb771b2fcf 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 204bcd03c0..92ce304447 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2035,6 +2037,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index b68c01a5f2..e2ba53ef74 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -439,8 +439,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -484,7 +484,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 673afee1e1..723867bf15 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index d1aa6daa40..b44892cc48 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 1525194112..0f111b733a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -236,6 +236,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -251,6 +252,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -331,6 +333,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -345,6 +348,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -394,7 +398,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -413,7 +417,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 5c014bdc66..f5c60ff82f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -186,6 +186,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint32		ndeleted;
 
@@ -203,6 +204,7 @@ typedef struct xl_btree_delete
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 63d3c63db2..83121c7266 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index a990d11ea8..f7816542a4 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -138,6 +138,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c5ffea40f2..6f5ffbfd27 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -616,6 +620,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.25.1

v7-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/x-patch; charset=US-ASCII; name=v7-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 38a41501bb640137d1778c5c27989efa05e80783 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fabr=C3=ADzio=20de=20Royes=20Mello?=
 <fabriziomello@gmail.com>
Date: Mon, 14 Dec 2020 13:46:20 -0300
Subject: [PATCH v7 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/022_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../023_standby_logical_decoding_conflicts.pl | 216 ++++++++++++++
 3 files changed, 525 insertions(+)
 create mode 100644 src/test/recovery/t/022_standby_logical_decoding_xmins.pl
 create mode 100644 src/test/recovery/t/023_standby_logical_decoding_conflicts.pl

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..1d62700ca6 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2223,6 +2223,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/022_standby_logical_decoding_xmins.pl b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..d0c449338f
--- /dev/null
+++ b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,216 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.25.1

v7-0005-Doc-changes-describing-details-about-logical-deco.patchtext/x-patch; charset=US-ASCII; name=v7-0005-Doc-changes-describing-details-about-logical-deco.patchDownload

From 6865ce782d8f2416255598b935fded887660a33b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fabr=C3=ADzio=20de=20Royes=20Mello?=
 <fabriziomello@gmail.com>
Date: Mon, 14 Dec 2020 13:47:41 -0300
Subject: [PATCH v7 5/5] Doc changes describing details about logical decoding

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037fac..003520dc02 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -248,6 +248,24 @@ $ pg_recvlogical -d postgres --slot=test --drop-slot
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.25.1

v7-0003-Handle-logical-slot-conflicts-on-standby.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Handle-logical-slot-conflicts-on-standby.patchDownload

From 873fd297349c309c26245afcd09e92ba09dd76fc Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fabr=C3=ADzio=20de=20Royes=20Mello?=
 <fabriziomello@gmail.com>
Date: Mon, 14 Dec 2020 11:52:02 -0300
Subject: [PATCH v7 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 ++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |   4 +
 src/backend/replication/slot.c       | 188 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |   4 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   1 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   2 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 294 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a5366..788f17c3a4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3842,6 +3842,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6580c98cb8..5c4a3cdc5b 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -413,6 +414,7 @@ gistRedoPageReuse(XLogReaderState *record)
 
 			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
 			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+												xlrec->onCatalogTable,
 												xlrec->node);
 		}
 	}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 3c60677662..5767890fa4 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 65bce76277..46c79def07 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7702,7 +7702,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -7738,7 +7739,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -7834,7 +7836,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -7971,7 +7975,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 5135b800af..46591d54fa 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -662,7 +662,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -959,6 +960,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 999d0ca15d..3782b3a69a 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 609edbaca6..2c285a6964 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10255,6 +10255,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c210bc..b67d0f6ef9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -938,6 +938,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6b60f293e9..b52755084d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4978,6 +4978,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6846,6 +6847,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8a0f51907b..faceab0fa3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -711,6 +713,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1141,12 +1201,25 @@ ReplicationSlotReserveWal(void)
 		{
 			XLogRecPtr	flushptr;
 
+			/* start at current insert position */
+			restart_lsn = GetXLogInsertRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
+		else
+		{
+			restart_lsn = GetRedoRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1164,6 +1237,121 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ee912b9d5e..18c23927d5 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3339,6 +3339,10 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index ffe67acea1..8047bcce63 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -579,6 +579,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4ea3cf1f5c..a66bfcebf1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -300,7 +301,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..649f419903 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2465,6 +2465,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2933,6 +2936,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6afe1b6f56..7a2da6780b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e6c7b070f6..bd79bc2b20 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5389,6 +5389,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '3585',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..b6627a7e76 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -687,6 +687,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 63bab6967f..90027da054 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -217,4 +217,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f3..ffcf9e2d49 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index faaf1d3817..bc3471777c 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -28,7 +28,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6293ab57bc..9d5bd887bf 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1860,6 +1860,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.25.1

#106

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#105)

5 attachment(s)

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

Hi,

On 12/15/20 7:24 PM, Fabrízio de Royes Mello wrote:

On Wed, Mar 18, 2020 at 4:50 PM Alvaro Herrera
<alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:

well, not "forever", but:

$ make check PROVE_TESTS=t/019_standby_logical_decoding_conflicts.pl

<http://019_standby_logical_decoding_conflicts.pl> PROVE_FLAGS=-v

...
cd /pgsql/source/master/src/test/recovery &&

TESTDIR='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery'
PATH="/pgsql/build/master/tmp_install/pgsql/install/master/bin:$PATH"
LD_LIBRARY_PATH="/pgsql/build/master/tmp_install/pgsql/install/master/lib"
PGPORT='655432'
PG_REGRESS='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery/../../../src/test/regress/pg_regress'
REGRESS_SHLIB='/pgsql/build/master/src/test/regress/regress.so'
/usr/bin/prove -I /pgsql/source/master/src/test/perl/ -I
/pgsql/source/master/src/test/recovery -v
t/019_standby_logical_decoding_conflicts.pl
<http://019_standby_logical_decoding_conflicts.pl>

t/019_standby_logical_decoding_conflicts.pl

<http://019_standby_logical_decoding_conflicts.pl> ..

1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
# poll_query_until timed out executing this query:
# SELECT '0/35C9190' <= replay_lsn AND state = 'streaming' FROM

pg_catalog.pg_stat_replication WHERE application_name = 'standby';

# expecting this output:
# t
# last actual query output:
#
# with stderr:
Bailout called. Further testing stopped: system pg_ctl failed
Bail out! system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make: *** [Makefile:19: check] Error 255

After rebase and did minimal tweaks (duplicated oid, TAP tests
numbering) I'm facing similar problem but in other place:

make -C src/test/recovery check
PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl
<http://023_standby_logical_decoding_conflicts.pl> PROVE_FLAGS=-v
...
/usr/bin/mkdir -p '/data/src/pg/main/src/test/recovery'/tmp_check
cd . && TESTDIR='/data/src/pg/main/src/test/recovery'
PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/bin:$PATH"
LD_LIBRARY_PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/lib"
PGPORT='65432'
PG_REGRESS='/data/src/pg/main/src/test/recovery/../../../src/test/regress/pg_regress'
REGRESS_SHLIB='/d/src/pg/main/src/test/regress/regress.so'
/usr/bin/prove -I ../../../src/test/perl/ -I . -v
t/023_standby_logical_decoding_conflicts.pl
<http://023_standby_logical_decoding_conflicts.pl>
t/023_standby_logical_decoding_conflicts.pl
<http://023_standby_logical_decoding_conflicts.pl> ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
not ok 3 - dropslot on standby dropped

# Failed test 'dropslot on standby dropped'
# at t/023_standby_logical_decoding_conflicts.pl
<http://023_standby_logical_decoding_conflicts.pl> line 67.
# got: 'logical'
# expected: ''
not ok 4 - activeslot on standby dropped

# Failed test 'activeslot on standby dropped'
# at t/023_standby_logical_decoding_conflicts.pl
<http://023_standby_logical_decoding_conflicts.pl> line 68.
# got: 'logical'
# expected: ''

TAP tests hang forever in `check_slots_dropped` exactly here:

# our client should've terminated in response to the walsender error
eval {
$slot_user_handle->finish;
};

3 and 4 were failing because the
ResolveRecoveryConflictWithLogicalSlots() call was missing in
ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.

The new version attached also provides a few changes to make it
compiling on the current master (it was not the case anymore).

I also had to change 023_standby_logical_decoding_conflicts.pl (had to
call $node_standby->create_logical_slot_on_standby($node_master,
'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE
should drops it's slots, including active slots" section)

So that now the tests are passing:

t/023_standby_logical_decoding_conflicts.pl ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
ok 3 - dropslot on standby dropped
ok 4 - activeslot on standby dropped
ok 5 - pg_recvlogical exited non-zero
#
ok 6 - recvlogical recovery conflict
ok 7 - recvlogical error detail
ok 8 - dropslot on standby created
ok 9 - activeslot on standby created
ok 10 - dropslot on standby dropped
ok 11 - activeslot on standby dropped
ok 12 - pg_recvlogical exited non-zero
#
ok 13 - recvlogical recovery conflict
ok 14 - recvlogical error detail
ok 15 - otherslot on standby created
ok 16 - dropslot on standby created
ok 17 - activeslot on standby created
ok 18 - database dropped on standby
ok 19 - dropslot on standby dropped
ok 20 - activeslot on standby dropped
ok 21 - pg_recvlogical exited non-zero
#
ok 22 - recvlogical recovery conflict
ok 23 - recvlogical error detail
ok 24 - otherslot on standby not dropped
ok
All tests successful.
Files=1, Tests=24, 4 wallclock secs ( 0.02 usr 0.00 sys + 1.27 cusr
0.37 csys = 1.66 CPU)
Result: PASS

Attached is the new version.

Bertrand

Attachments:

v8-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v8-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From a3e42f2fc53afd2bcbe6da9d029d91694a34562e Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Thu, 16 Jan 2020 10:05:15 +0530
Subject: [PATCH v7 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8dd225c2e1..609edbaca6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5038,6 +5038,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..bb7c80d6cc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -203,11 +203,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7d70..ec4f7b95e4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -110,23 +110,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -295,6 +294,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9c7cf13d4d..8a0f51907b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1097,37 +1097,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d5c9bc31d8..d8308d7d27 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2878,10 +2878,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..ddc167d2c9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,6 +324,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.25.1

v8-0002-Add-info-in-WAL-records-in-preparation-for-logica.patchtext/plain; charset=UTF-8; name=v8-0002-Add-info-in-WAL-records-in-preparation-for-logica.patch; x-mac-creator=0; x-mac-type=0Download

 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     |  4 ++++
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 71 insertions(+), 16 deletions(-)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 992936cfa8..0d300f6692 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -345,7 +345,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index cf53dad474..446a61bca6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index c1d4b5d4f2..82271ef8a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -616,7 +616,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -627,6 +628,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5b9cfb26cf..d6ddfc306c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7676,12 +7676,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7717,6 +7718,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7767,6 +7769,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7797,7 +7800,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7807,6 +7810,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..a42ff04c26 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -718,7 +718,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c2..8f781cfca1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -32,6 +32,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/memdebug.h"
 #include "utils/snapmgr.h"
 
@@ -774,6 +775,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1323,6 +1325,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..397648d2aa 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 85c458bc46..65cc378947 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2035,6 +2037,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 51586b883d..7260e92ed5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -236,6 +236,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -251,6 +252,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -331,6 +333,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -345,6 +348,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -394,7 +398,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -413,7 +417,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..6995dc2558 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index ae720c1496..8fd3129eae 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -138,6 +138,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..f8006a7125 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -616,6 +620,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \

v8-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v8-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 ++-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 +++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |   4 +
 src/backend/replication/slot.c       | 188 +++++++++++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |   7 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   1 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   2 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 297 insertions(+), 9 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3cdb1aff3c..ecd6898e0a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3848,6 +3848,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 82271ef8a9..0ff503bfb6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -413,6 +414,7 @@ gistRedoPageReuse(XLogReaderState *record)
 
 			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
 			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+												xlrec->onCatalogTable,
 												xlrec->node);
 		}
 	}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d6ddfc306c..8daa331bb3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8234,7 +8234,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8270,7 +8271,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8366,7 +8368,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8503,7 +8507,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..2a61e02348 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -981,6 +982,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 58085fdcf7..8a00977a83 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10283,6 +10283,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5d89e77dbe..460dc3042a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -938,6 +938,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f24a33ef1..117acc87da 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4984,6 +4984,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6852,6 +6853,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8b02d2f437..47be24b131 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -711,6 +713,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1141,12 +1201,25 @@ ReplicationSlotReserveWal(void)
 		{
 			XLogRecPtr	flushptr;
 
+			/* start at current insert position */
+			restart_lsn = GetXLogInsertRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
+		else
+		{
+			restart_lsn = GetRedoRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1164,6 +1237,121 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf12eda504..3439026eb1 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3346,6 +3346,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 583efaecff..56853c1a74 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -579,6 +579,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..0087b2180e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 28055680aa..f3a4f1b588 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2466,6 +2466,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2976,6 +2979,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5c12a165a1..1b998b428e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1496,6 +1496,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1539,6 +1554,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d27336adcd..fd235cde42 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5391,6 +5391,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c38b689710..5ada41c0e7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -687,6 +687,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53f636c56f..6f65116a64 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -217,4 +217,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2b1f340b82..4f878e49a5 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -29,7 +29,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a687e99d1e..c863c73519 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1860,6 +1860,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;

v8-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v8-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

 src/test/perl/PostgresNode.pm                      |  37 +++
 .../t/022_standby_logical_decoding_xmins.pl        | 272 +++++++++++++++++++++
 .../t/023_standby_logical_decoding_conflicts.pl    | 215 ++++++++++++++++
 3 files changed, 524 insertions(+)
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..1d62700ca6 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2223,6 +2223,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/022_standby_logical_decoding_xmins.pl b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..3094e006c2
--- /dev/null
+++ b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,215 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+create_logical_slots();
+$handle = make_slot_active();
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);

v8-0005-Doc-changes-describing-details-about-logical-deco.patchtext/plain; charset=UTF-8; name=v8-0005-Doc-changes-describing-details-about-logical-deco.patch; x-mac-creator=0; x-mac-type=0Download

From 6865ce782d8f2416255598b935fded887660a33b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fabr=C3=ADzio=20de=20Royes=20Mello?=
 <fabriziomello@gmail.com>
Date: Mon, 14 Dec 2020 13:47:41 -0300
Subject: [PATCH v7 5/5] Doc changes describing details about logical decoding

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037fac..003520dc02 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -248,6 +248,24 @@ $ pg_recvlogical -d postgres --slot=test --drop-slot
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.25.1

#107

Fabrízio de Royes Mello

fabriziomello@gmail.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#106)

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

On Mon, Jan 18, 2021 at 8:48 AM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

3 and 4 were failing because the

ResolveRecoveryConflictWithLogicalSlots() call was missing in
ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.

The new version attached also provides a few changes to make it compiling

on the current master (it was not the case anymore).

I also had to change 023_standby_logical_decoding_conflicts.pl (had to

call $node_standby->create_logical_slot_on_standby($node_master,
'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE
should drops it's slots, including active slots" section)

Awesome and thanks a lot.

Seems your patch series is broken... can you please `git format-patch` and
send again?

Regards,

--
Fabrízio de Royes Mello
PostgreSQL Developer at OnGres Inc. - https://ongres.com

#108

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#107)

5 attachment(s)

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

Hi,

On 1/25/21 8:34 PM, Fabrízio de Royes Mello wrote:

On Mon, Jan 18, 2021 at 8:48 AM Drouvot, Bertrand <bdrouvot@amazon.com
<mailto:bdrouvot@amazon.com>> wrote:

3 and 4 were failing because the

ResolveRecoveryConflictWithLogicalSlots() call was missing in
ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.

The new version attached also provides a few changes to make it

compiling on the current master (it was not the case anymore).

I also had to change 023_standby_logical_decoding_conflicts.pl

<http://023_standby_logical_decoding_conflicts.pl> (had to call
$node_standby->create_logical_slot_on_standby($node_master,
'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE
should drops it's slots, including active slots" section)

Awesome and thanks a lot.

Seems your patch series is broken... can you please `git format-patch`
and send again?

Thanks for pointing out!

Enclosed a new series created with "format-patch" and that can be
applied with "git am":

$ git am v8-000*.patch
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot
conflict handling.
Applying: Handle logical slot conflicts on standby.
Applying: New TAP test for logical decoding on standby.
Applying: Doc changes describing details about logical decoding.

Bertrand

Attachments:

v8-0005-Doc-changes-describing-details-about-logical-deco.patchtext/plain; charset=UTF-8; name=v8-0005-Doc-changes-describing-details-about-logical-deco.patch; x-mac-creator=0; x-mac-type=0Download

From e594a15365dfd301dea345fc993806dea510ca6a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Jan 2021 09:14:19 +0000
Subject: [PATCH v8 5/5] Doc changes describing details about logical decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index cf705ed9cd..3bf1889633 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -299,6 +299,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.14.5

v8-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v8-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From a78d08f8203b5e7d4d85e4e4ff033a73510cebaf Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Jan 2021 09:13:23 +0000
Subject: [PATCH v8 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                      |  37 +++
 .../t/022_standby_logical_decoding_xmins.pl        | 272 +++++++++++++++++++++
 .../t/023_standby_logical_decoding_conflicts.pl    | 215 ++++++++++++++++
 3 files changed, 524 insertions(+)
   6.0% src/test/perl/
  93.9% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..1d62700ca6 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2223,6 +2223,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/022_standby_logical_decoding_xmins.pl b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..3094e006c2
--- /dev/null
+++ b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,215 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+create_logical_slots();
+$handle = make_slot_active();
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.14.5

v8-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v8-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 739d71b8ce6d67cb4d74b07fe4fc2b36658a3de6 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Jan 2021 09:10:32 +0000
Subject: [PATCH v8 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 ++-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 +++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |   4 +
 src/backend/replication/slot.c       | 188 +++++++++++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |   7 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   1 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   2 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 297 insertions(+), 9 deletions(-)
   3.0% doc/src/sgml/
   6.8% src/backend/access/heap/
   6.4% src/backend/access/transam/
   6.2% src/backend/access/
  52.7% src/backend/replication/
   4.6% src/backend/storage/ipc/
   8.5% src/backend/tcop/
   3.3% src/backend/utils/adt/
   5.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9496f76b1f..a19c7938ec 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3921,6 +3921,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 82271ef8a9..0ff503bfb6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -413,6 +414,7 @@ gistRedoPageReuse(XLogReaderState *record)
 
 			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
 			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+												xlrec->onCatalogTable,
 												xlrec->node);
 		}
 	}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 027c047904..b6d8effc03 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8367,7 +8367,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8403,7 +8404,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8499,7 +8501,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8636,7 +8640,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..2a61e02348 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -981,6 +982,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 86eedbf9c3..69da24160e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10306,6 +10306,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..d8f606ed49 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -945,6 +945,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..cfc078873d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5068,6 +5068,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6943,6 +6944,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8b02d2f437..47be24b131 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -711,6 +713,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1141,12 +1201,25 @@ ReplicationSlotReserveWal(void)
 		{
 			XLogRecPtr	flushptr;
 
+			/* start at current insert position */
+			restart_lsn = GetXLogInsertRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
+		else
+		{
+			restart_lsn = GetRedoRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1164,6 +1237,121 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf12eda504..3439026eb1 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3346,6 +3346,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c43cdd685b..4740e8352a 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -680,6 +680,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..0087b2180e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cb5a96117f..8aeeddf0a3 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2466,6 +2466,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2979,6 +2982,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..76ba80f5ca 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1496,6 +1496,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1539,6 +1554,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b5f52d4e4a..289e8078da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5391,6 +5391,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..eb3efa3bbe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -714,6 +714,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53f636c56f..6f65116a64 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -217,4 +217,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 94d33851d0..802482e1cf 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,7 +30,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..fb02e7be52 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1867,6 +1867,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.14.5

v8-0002-Add-info-in-WAL-records-in-preparation-for-logica.patchtext/plain; charset=UTF-8; name=v8-0002-Add-info-in-WAL-records-in-preparation-for-logica.patch; x-mac-creator=0; x-mac-type=0Download

From ae63aff0fe53e0c6a66bf537534fe0e245ca18a6 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Jan 2021 09:09:19 +0000
Subject: [PATCH v8 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     |  4 ++++
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 71 insertions(+), 16 deletions(-)
  16.6% src/backend/access/gist/
  22.7% src/backend/access/heap/
   5.0% src/backend/access/nbtree/
   8.2% src/backend/access/spgist/
   7.5% src/backend/utils/cache/
  21.5% src/include/access/
  15.5% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f203bb594c..0ea3330ff2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index cf53dad474..446a61bca6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index c1d4b5d4f2..82271ef8a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -616,7 +616,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -627,6 +628,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9926e2bd54..027c047904 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7809,12 +7809,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7850,6 +7851,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7900,6 +7902,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7930,7 +7933,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7940,6 +7943,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..a42ff04c26 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -718,7 +718,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c2..8f781cfca1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -32,6 +32,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/memdebug.h"
 #include "utils/snapmgr.h"
 
@@ -774,6 +775,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1323,6 +1325,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..397648d2aa 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 85c458bc46..65cc378947 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2035,6 +2037,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..6995dc2558 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index ae720c1496..8fd3129eae 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -138,6 +138,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..f8006a7125 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -616,6 +620,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.14.5

v8-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v8-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 97f11dfcfabe61660a192c1e3ce8e5e79c1d4586 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Jan 2021 09:08:46 +0000
Subject: [PATCH v8 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 ++++++
 src/backend/replication/logical/decode.c  | 22 +++++++++++-
 src/backend/replication/logical/logical.c | 37 +++++++++++---------
 src/backend/replication/slot.c            | 57 ++++++++++++++++++++-----------
 src/backend/replication/walsender.c       | 10 +++---
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.8% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc007b8963..86eedbf9c3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5061,6 +5061,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afa1df00d0..f01cb2ec94 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0977aec711..67a03e0fbb 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e00c7ffc01..8b02d2f437 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1097,37 +1097,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8545c6c423..6799e67f18 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2878,10 +2878,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..16ca0031d5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,6 +324,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.14.5

#109

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#108)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/26/21 10:31 AM, Drouvot, Bertrand wrote:

Hi,

On 1/25/21 8:34 PM, Fabrízio de Royes Mello wrote:

On Mon, Jan 18, 2021 at 8:48 AM Drouvot, Bertrand
<bdrouvot@amazon.com <mailto:bdrouvot@amazon.com>> wrote:

3 and 4 were failing because the

ResolveRecoveryConflictWithLogicalSlots() call was missing in
ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.

The new version attached also provides a few changes to make it

compiling on the current master (it was not the case anymore).

I also had to change 023_standby_logical_decoding_conflicts.pl

<http://023_standby_logical_decoding_conflicts.pl> (had to call
$node_standby->create_logical_slot_on_standby($node_master,
'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE
should drops it's slots, including active slots" section)

Awesome and thanks a lot.

Seems your patch series is broken... can you please `git
format-patch` and send again?

Thanks for pointing out!

Enclosed a new series created with "format-patch" and that can be
applied with "git am":

$ git am v8-000*.patch
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot
conflict handling.
Applying: Handle logical slot conflicts on standby.
Applying: New TAP test for logical decoding on standby.
Applying: Doc changes describing details about logical decoding.

Bertrand

Had to do a little change to make it compiling again (re-add the heapRel
argument in _bt_delitems_delete() that was removed by commit
dc43492e46c7145a476cb8ca6200fc8eefe673ef).

Given that this attached v9 version:

* compiles successfully on current master
* passes "make check"
* passes the 2 associated tap tests "make -C src/test/recovery check
PROVE_TESTS=t/022_standby_logical_decoding_xmins.pl PROVE_FLAGS=-v"
and "make -C src/test/recovery check
PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v"

wouldn't that make sense to (re)add this patch in the commitfest?

Thanks

Bertrand

Attachments:

v9-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v9-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 41c2e190eac0a1f355dad62be580283f0b422093 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 4 Feb 2021 16:14:07 +0000
Subject: [PATCH v9 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 ++++++
 src/backend/replication/logical/decode.c  | 22 +++++++++++-
 src/backend/replication/logical/logical.c | 37 +++++++++++---------
 src/backend/replication/slot.c            | 57 ++++++++++++++++++++-----------
 src/backend/replication/walsender.c       | 10 +++---
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.8% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f03bd473e2..856b8412e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5061,6 +5061,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afa1df00d0..f01cb2ec94 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0977aec711..67a03e0fbb 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e00c7ffc01..8b02d2f437 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1097,37 +1097,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8545c6c423..6799e67f18 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2878,10 +2878,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..16ca0031d5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,6 +324,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.14.5

v9-0002-Add-info-in-WAL-records-in-preparation-for-logica.patchtext/plain; charset=UTF-8; name=v9-0002-Add-info-in-WAL-records-in-preparation-for-logica.patch; x-mac-creator=0; x-mac-type=0Download

From 61cc5909825e523922614c464cb73aeb4d7be984 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 4 Feb 2021 16:23:07 +0000
Subject: [PATCH v9 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 76 insertions(+), 19 deletions(-)
  15.3% src/backend/access/gist/
  20.9% src/backend/access/heap/
  12.4% src/backend/access/nbtree/
   7.6% src/backend/access/spgist/
   6.9% src/backend/utils/cache/
  19.9% src/include/access/
  14.2% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index cf53dad474..446a61bca6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index c1d4b5d4f2..82271ef8a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -616,7 +616,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -627,6 +628,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9926e2bd54..027c047904 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7809,12 +7809,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7850,6 +7851,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -7900,6 +7902,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7930,7 +7933,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7940,6 +7943,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..a42ff04c26 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -718,7 +718,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 41dc3f8fdf..8f781cfca1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -32,6 +32,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/memdebug.h"
 #include "utils/snapmgr.h"
 
@@ -41,7 +42,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -773,6 +775,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedXid = latestRemovedXid;
@@ -1259,7 +1262,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1321,6 +1325,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1648,7 +1654,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..397648d2aa 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 85c458bc46..65cc378947 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2035,6 +2037,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..6995dc2558 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	TransactionId latestRemovedXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index ae720c1496..8fd3129eae 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -138,6 +138,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..f8006a7125 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -616,6 +620,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.14.5

v9-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v9-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From ff74347f19fab702531dc58b620080cce21fdd7d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 4 Feb 2021 16:24:35 +0000
Subject: [PATCH v9 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 ++-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 +++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |   4 +
 src/backend/replication/slot.c       | 188 +++++++++++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |   7 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   1 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   2 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 297 insertions(+), 9 deletions(-)
   3.0% doc/src/sgml/
   6.8% src/backend/access/heap/
   6.4% src/backend/access/transam/
   6.2% src/backend/access/
  52.7% src/backend/replication/
   4.6% src/backend/storage/ipc/
   8.5% src/backend/tcop/
   3.3% src/backend/utils/adt/
   5.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..5fede5684c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3923,6 +3923,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 82271ef8a9..0ff503bfb6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -413,6 +414,7 @@ gistRedoPageReuse(XLogReaderState *record)
 
 			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
 			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+												xlrec->onCatalogTable,
 												xlrec->node);
 		}
 	}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 027c047904..b6d8effc03 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8367,7 +8367,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8403,7 +8404,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8499,7 +8501,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8636,7 +8640,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..2a61e02348 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -981,6 +982,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	if (InHotStandby)
 	{
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable,
 											xlrec->node);
 	}
 }
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 856b8412e7..1696557cfb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10309,6 +10309,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..d8f606ed49 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -945,6 +945,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..cfc078873d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5068,6 +5068,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6943,6 +6944,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8b02d2f437..47be24b131 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -711,6 +713,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1141,12 +1201,25 @@ ReplicationSlotReserveWal(void)
 		{
 			XLogRecPtr	flushptr;
 
+			/* start at current insert position */
+			restart_lsn = GetXLogInsertRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
+		else
+		{
+			restart_lsn = GetRedoRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1164,6 +1237,121 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf12eda504..3439026eb1 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3346,6 +3346,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c43cdd685b..4740e8352a 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -680,6 +680,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..0087b2180e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cb5a96117f..8aeeddf0a3 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2466,6 +2466,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -2979,6 +2982,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..76ba80f5ca 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1496,6 +1496,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1539,6 +1554,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4e0c9be58c..eede0dad5b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5399,6 +5399,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..eb3efa3bbe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -714,6 +714,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53f636c56f..6f65116a64 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -217,4 +217,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 94d33851d0..802482e1cf 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,7 +30,7 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+									bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6173473de9..fb02e7be52 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1867,6 +1867,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.14.5

v9-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v9-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From de244c04889c9acdb3caa44c00d2c3097eee1e9e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 4 Feb 2021 16:26:18 +0000
Subject: [PATCH v9 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                      |  37 +++
 .../t/022_standby_logical_decoding_xmins.pl        | 272 +++++++++++++++++++++
 .../t/023_standby_logical_decoding_conflicts.pl    | 215 ++++++++++++++++
 3 files changed, 524 insertions(+)
   6.0% src/test/perl/
  93.9% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..1d62700ca6 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2223,6 +2223,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/022_standby_logical_decoding_xmins.pl b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..3094e006c2
--- /dev/null
+++ b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,215 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+create_logical_slots();
+$handle = make_slot_active();
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.14.5

v9-0005-Doc-changes-describing-details-about-logical-deco.patchtext/plain; charset=UTF-8; name=v9-0005-Doc-changes-describing-details-about-logical-deco.patch; x-mac-creator=0; x-mac-type=0Download

From 1e591678a27b573d3b615bb51d40554435121a7c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 4 Feb 2021 16:27:41 +0000
Subject: [PATCH v9 5/5] Doc changes describing details about logical decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index cf705ed9cd..3bf1889633 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -299,6 +299,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.14.5

#110

Fabrízio de Royes Mello

fabriziomello@gmail.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#109)

Re: Minimal logical decoding on standbys

On Thu, Feb 4, 2021 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

Had to do a little change to make it compiling again (re-add the heapRel

argument in _bt_delitems_delete() that was removed by commit
dc43492e46c7145a476cb8ca6200fc8eefe673ef).

Given that this attached v9 version:

compiles successfully on current master
passes "make check"
passes the 2 associated tap tests "make -C src/test/recovery check

PROVE_TESTS=t/022_standby_logical_decoding_xmins.pl PROVE_FLAGS=-v" and
"make -C src/test/recovery check PROVE_TESTS=t/
023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v"

Perfect thanks... will review ASAP!

wouldn't that make sense to (re)add this patch in the commitfest?

Added to next commitfest: https://commitfest.postgresql.org/32/2968/

Regards,

--
Fabrízio de Royes Mello
PostgreSQL Developer at OnGres Inc. - https://ongres.com

#111

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#110)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 2/4/21 6:33 PM, Fabrízio de Royes Mello wrote:

On Thu, Feb 4, 2021 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com
<mailto:bdrouvot@amazon.com>> wrote:

Had to do a little change to make it compiling again (re-add the

heapRel argument in _bt_delitems_delete() that was removed by commit
dc43492e46c7145a476cb8ca6200fc8eefe673ef).

Given that this attached v9 version:

compiles successfully on current master
passes "make check"
passes the 2 associated tap tests "make -C src/test/recovery check

PROVE_TESTS=t/022_standby_logical_decoding_xmins.pl
<http://022_standby_logical_decoding_xmins.pl> PROVE_FLAGS=-v" and
"make -C src/test/recovery check
PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl
<http://023_standby_logical_decoding_conflicts.pl> PROVE_FLAGS=-v"

Perfect thanks... will review ASAP!

Thanks!

Just made minor changes to make it compiling again on current master
(mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid()
that has been introduced in e5d8a99903).

Please find enclosed the new patch version that currently passes "make
check" and the 2 associated TAP tests.

I'll have a look to the whole thread to check if there is anything else
waiting in the pipe regarding this feature, unless some of you know off
the top of their head?

Bertrand

Attachments:

v10-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v10-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 6e80f46ace86593b5d3686e792a6ac3a0790e3d2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 17 Mar 2021 16:14:23 +0000
Subject: [PATCH v10 5/5] Doc changes describing details about logical
 decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d609..a2a76d2f65 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v10-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v10-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 7101aed509bbcb98d76562e098bfc5f2c5b5be1e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 17 Mar 2021 16:13:12 +0000
Subject: [PATCH v10 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/022_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../023_standby_logical_decoding_conflicts.pl | 215 ++++++++++++++
 3 files changed, 524 insertions(+)
   6.0% src/test/perl/
  93.9% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..1d62700ca6 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2223,6 +2223,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/022_standby_logical_decoding_xmins.pl b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..3094e006c2
--- /dev/null
+++ b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,215 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+create_logical_slots();
+$handle = make_slot_active();
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v10-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v10-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 5a5cb22a062d208897b440553156c14b9480cfea Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 17 Mar 2021 15:37:14 +0000
Subject: [PATCH v10 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 ++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |   4 +
 src/backend/replication/slot.c       | 188 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   1 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 300 insertions(+), 12 deletions(-)
   6.5% src/backend/access/heap/
   6.2% src/backend/access/transam/
   6.0% src/backend/access/
  51.1% src/backend/replication/
   6.6% src/backend/storage/ipc/
   8.3% src/backend/tcop/
   3.2% src/backend/utils/adt/
   6.1% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index db4b4e460c..71a83c926a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3980,6 +3980,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 97d814b927..b6c0d8b290 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ac004f1258..903b37a644 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8512,7 +8512,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8548,7 +8549,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8644,7 +8646,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8781,7 +8785,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e0f39e80ac..7eeb26dead 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10373,6 +10373,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7b..982856fd0c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -945,6 +945,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1e2d94951..de30f8fa9b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5113,6 +5113,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6992,6 +6993,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9db20278f6..f4e2f1020b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -719,6 +721,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1149,12 +1209,25 @@ ReplicationSlotReserveWal(void)
 		{
 			XLogRecPtr	flushptr;
 
+			/* start at current insert position */
+			restart_lsn = GetXLogInsertRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
+		else
+		{
+			restart_lsn = GetRedoRecPtr();
+			SpinLockAcquire(&slot->mutex);
+			slot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&slot->mutex);
+		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1172,6 +1245,121 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4fc6ffb917..cd046df619 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3372,6 +3372,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..63642761c6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 17de5a6d0e..82605ce948 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -458,7 +463,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -476,7 +481,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2b1b68109f..4804700e21 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2438,6 +2438,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3000,6 +3003,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5102227a60..12e1e17bfa 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 93393fcfd4..0fba589a94 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5403,6 +5403,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c04802..f2d424e1b3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -720,6 +720,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b12cc122a..b0e17d4e1d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.18.4

v10-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v10-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From b3163282465ef3994e118b74d433da668379a473 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 17 Mar 2021 15:36:22 +0000
Subject: [PATCH v10 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 76 insertions(+), 19 deletions(-)
  15.3% src/backend/access/gist/
  20.9% src/backend/access/heap/
  12.4% src/backend/access/nbtree/
   7.6% src/backend/access/spgist/
   6.9% src/backend/utils/cache/
  20.0% src/include/access/
  14.2% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index a3ec9f2cfe..04256addd6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..97d814b927 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..ac004f1258 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7954,12 +7954,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7995,6 +7996,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -8045,6 +8047,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8075,7 +8078,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8085,6 +8088,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8341879d89..e8129be634 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -737,7 +737,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index fc744cf9fd..528d19293d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -32,6 +32,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/memdebug.h"
 #include "utils/snapmgr.h"
 
@@ -41,7 +42,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183..cd6569955c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..90fa5dfc7c 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2062,6 +2064,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..f8006a7125 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -616,6 +620,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v10-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v10-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 0ea46d486028944503e29534fb6346d74c3d7fb8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 17 Mar 2021 15:34:50 +0000
Subject: [PATCH v10 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.8% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f4d1ce5dea..e0f39e80ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..f284318cd5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75deb72..534e2566cf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..9db20278f6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1105,37 +1105,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..c4a4972669 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6d384d3ce6..dc2862c6f1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

#112

Fabrízio de Royes Mello

fabriziomello@gmail.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#111)

Re: Minimal logical decoding on standbys

On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

Thanks!

Just made minor changes to make it compiling again on current master

(mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid()
that has been introduced in e5d8a99903).

Please find enclosed the new patch version that currently passes "make

check" and the 2 associated TAP tests.

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch

Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict
handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for logical
slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

I'll have a look to the whole thread to check if there is anything else

waiting in the pipe regarding this feature, unless some of you know off the
top of their head?

Will do the same!!! But as far I remember last time I checked it everything
discussed is covered in this patch set.

Regards,

--
Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#113

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#112)

Re: Minimal logical decoding on standbys

On 3/22/21 3:10 PM, Fabrízio de Royes Mello wrote:

*CAUTION*: This email originated from outside of the organization. Do
not click links or open attachments unless you can confirm the sender
and know the content is safe.

On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand <bdrouvot@amazon.com
<mailto:bdrouvot@amazon.com>> wrote:

Thanks!

Just made minor changes to make it compiling again on current master

(mainly had to take care of
ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced
in e5d8a99903).

Please find enclosed the new patch version that currently passes

"make check" and the 2 associated TAP tests.

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot
conflict handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for
logical slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

oh Indeed, it's moving so fast!

Let me rebase it (It was already my plan to do so as I have observed
during the week end that the
v7-0003-Handle-logical-slot-conflicts-on-standby.patch
</messages/by-id/attachment/117031/v7-0003-Handle-logical-slot-conflicts-on-standby.patch>
introduced incorrect changes (that should not be there at all in
ReplicationSlotReserveWal()) that have been kept in v8, v9 and v10.

I'll have a look to the whole thread to check if there is anything

else waiting in the pipe regarding this feature, unless some of you
know off the top of their head?

Will do the same!!!

Thanks!

But as far I remember last time I checked it everything discussed is
covered in this patch set.

That's also what I have observed so far.

Bertrand

#114

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#113)

5 attachment(s)

Re: Minimal logical decoding on standbys

On 3/22/21 3:57 PM, Drouvot, Bertrand wrote:

On 3/22/21 3:10 PM, Fabrízio de Royes Mello wrote:

On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand
<bdrouvot@amazon.com <mailto:bdrouvot@amazon.com>> wrote:

Thanks!

Just made minor changes to make it compiling again on current

master (mainly had to take care of
ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced
in e5d8a99903).

Please find enclosed the new patch version that currently passes

"make check" and the 2 associated TAP tests.

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot
conflict handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for
logical slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

oh Indeed, it's moving so fast!

Let me rebase it (It was already my plan to do so as I have observed
during the week end that the
v7-0003-Handle-logical-slot-conflicts-on-standby.patch
</messages/by-id/attachment/117031/v7-0003-Handle-logical-slot-conflicts-on-standby.patch>
introduced incorrect changes (that should not be there at all in
ReplicationSlotReserveWal()) that have been kept in v8, v9 and v10.

please find enclosed the rebase version, that also contains the fix for
ReplicationSlotReserveWal() mentioned above.

Going back to looking at the whole thread.

Bertrand

Attachments:

v11-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v11-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 926b32e24a0267d106804ea757ee89b06e7903d1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 22 Mar 2021 15:35:57 +0000
Subject: [PATCH v11 5/5] Doc changes describing details about logical
 decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d609..a2a76d2f65 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v11-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v11-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 7d1516e6e10244922d6cdacf4d5a6349974d8356 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 22 Mar 2021 15:34:52 +0000
Subject: [PATCH v11 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/022_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../023_standby_logical_decoding_conflicts.pl | 215 ++++++++++++++
 3 files changed, 524 insertions(+)
   6.0% src/test/perl/
  93.9% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e05993be..8a108e9e16 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2245,6 +2245,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/022_standby_logical_decoding_xmins.pl b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..9010a8c92e
--- /dev/null
+++ b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,215 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v11-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v11-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 3cdd756b72f7b76aa35750088e36498499f68552 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 22 Mar 2021 15:33:55 +0000
Subject: [PATCH v11 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 +++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |   4 +
 src/backend/replication/slot.c       | 175 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   1 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 287 insertions(+), 12 deletions(-)
   3.0% doc/src/sgml/
   6.8% src/backend/access/heap/
   6.4% src/backend/access/transam/
   6.2% src/backend/access/
  49.4% src/backend/replication/
   6.8% src/backend/storage/ipc/
   8.5% src/backend/tcop/
   3.3% src/backend/utils/adt/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index db4b4e460c..71a83c926a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3980,6 +3980,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 97d814b927..b6c0d8b290 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ac004f1258..903b37a644 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8512,7 +8512,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8548,7 +8549,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8644,7 +8646,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8781,7 +8785,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a21cba362..c5b5d6b610 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10373,6 +10373,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7b..982856fd0c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -945,6 +945,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 208a33692f..e9bc52a921 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5113,6 +5113,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6996,6 +6997,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9db20278f6..b1590a9eef 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -719,6 +721,64 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1172,6 +1232,121 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4fc6ffb917..cd046df619 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3372,6 +3372,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..63642761c6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 17de5a6d0e..82605ce948 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -458,7 +463,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -476,7 +481,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2b1b68109f..4804700e21 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2438,6 +2438,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3000,6 +3003,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5102227a60..12e1e17bfa 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e259531f60..d0ad482529 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5403,6 +5403,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c04802..f2d424e1b3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -720,6 +720,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b12cc122a..b0e17d4e1d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.18.4

v11-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v11-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 4ef960f4daec338f4b2f63186ca9e97261a0f85a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 22 Mar 2021 15:33:08 +0000
Subject: [PATCH v11 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 76 insertions(+), 19 deletions(-)
  15.3% src/backend/access/gist/
  20.9% src/backend/access/heap/
  12.4% src/backend/access/nbtree/
   7.6% src/backend/access/spgist/
   6.9% src/backend/utils/cache/
  20.0% src/include/access/
  14.2% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index a3ec9f2cfe..04256addd6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..97d814b927 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..ac004f1258 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7954,12 +7954,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7995,6 +7996,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -8045,6 +8047,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8075,7 +8078,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8085,6 +8088,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8341879d89..e8129be634 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -737,7 +737,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 530d924bff..e5d725e4ea 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183..cd6569955c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..90fa5dfc7c 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2062,6 +2064,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 5375a37dd1..1922841ad1 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -347,6 +348,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -641,6 +645,11 @@ typedef struct PartitionedTableRdOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v11-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v11-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 02e9adc7fc4c9d6d54a0dad4622d73cf06bb8e19 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 22 Mar 2021 15:31:37 +0000
Subject: [PATCH v11 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.8% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..f284318cd5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75deb72..534e2566cf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..9db20278f6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1105,37 +1105,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..c4a4972669 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..06bcd2fc56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

#115

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#114)

5 attachment(s)

Re: Minimal logical decoding on standbys

On 3/22/21 4:56 PM, Drouvot, Bertrand wrote:

On 3/22/21 3:57 PM, Drouvot, Bertrand wrote:

On 3/22/21 3:10 PM, Fabrízio de Royes Mello wrote:

On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand
<bdrouvot@amazon.com <mailto:bdrouvot@amazon.com>> wrote:

Thanks!

Just made minor changes to make it compiling again on current

master (mainly had to take care of
ResolveRecoveryConflictWithSnapshotFullXid() that has been
introduced in e5d8a99903).

Please find enclosed the new patch version that currently passes

"make check" and the 2 associated TAP tests.

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot
conflict handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for
logical slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

oh Indeed, it's moving so fast!

Let me rebase it (It was already my plan to do so as I have observed
during the week end that the
v7-0003-Handle-logical-slot-conflicts-on-standby.patch
</messages/by-id/attachment/117031/v7-0003-Handle-logical-slot-conflicts-on-standby.patch>
introduced incorrect changes (that should not be there at all in
ReplicationSlotReserveWal()) that have been kept in v8, v9 and v10.

please find enclosed the rebase version, that also contains the fix
for ReplicationSlotReserveWal() mentioned above.

Going back to looking at the whole thread.

I have one remark regarding the conflicts:

The logical slots are dropped if a conflict is detected.

But, if the slot is not active before being dropped (say wal_level is
changed to < logical on master and a logical slot is not active on the
standby) then its corresponding
pg_stat_database_conflicts.confl_logicalslot is not incremented (as it
would be incremented "only" during the cancel of an "active" backend).

I think, it should be incremented in all the cases (active or not), what
do you think?

I updated the patch to handle this scenario (see the new
pgstat_send_droplogicalslot() in
v12-0003-Handle-logical-slot-conflicts-on-standby.patch).

I also added more tests in 023_standby_logical_decoding_conflicts.pl to
verify that pg_stat_database_conflicts.confl_logicalslot is updated as
expected (see check_confl_logicalslot() in
v12-0004-New-TAP-test-for-logical-decoding-on-standby.patch).

Except this remark and the associated changes, then it looks good to me.

Bertrand

Attachments:

v12-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v12-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 4c70cbf550135e7f3b63ee4e897e0239c1605f64 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 11:19:18 +0000
Subject: [PATCH v12 5/5] Doc changes describing details about logical
 decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d609..a2a76d2f65 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v12-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v12-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From f9ddafae0b76281a4768c58d14f399ea1857c068 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 11:18:33 +0000
Subject: [PATCH v12 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/022_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../023_standby_logical_decoding_conflicts.pl | 228 +++++++++++++++
 3 files changed, 537 insertions(+)
   5.8% src/test/perl/
  94.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e05993be..8a108e9e16 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2245,6 +2245,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/022_standby_logical_decoding_xmins.pl b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/022_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..426a412b1f
--- /dev/null
+++ b/src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,228 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v12-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v12-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 572a6ebd871390f5dd677365c967918db54ce4e0 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 11:17:33 +0000
Subject: [PATCH v12 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 +++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/postmaster/pgstat.c      |  22 ++++
 src/backend/replication/slot.c       | 182 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   2 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 313 insertions(+), 12 deletions(-)
   6.4% src/backend/access/heap/
   6.1% src/backend/access/transam/
   5.9% src/backend/access/
   4.9% src/backend/postmaster/
  48.3% src/backend/replication/
   6.4% src/backend/storage/ipc/
   8.0% src/backend/tcop/
   3.1% src/backend/utils/adt/
   6.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 19540206f9..eb5bb5209e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3980,6 +3980,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 97d814b927..b6c0d8b290 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ac004f1258..903b37a644 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8512,7 +8512,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8548,7 +8549,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8644,7 +8646,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8781,7 +8785,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a21cba362..c5b5d6b610 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10373,6 +10373,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7b..982856fd0c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -945,6 +945,7 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
             pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
     FROM pg_database D;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7af7c2707..56902e124a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4639,6 +4639,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *	Tell the collector about a logical slot being dropped
+ *	due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+	PgStat_MsgRecoveryConflict msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dbOid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_send_bgwriter() -
  *
@@ -5113,6 +5131,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6996,6 +7015,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9db20278f6..928af0509a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -719,6 +721,71 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+	bool initially_not_active;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	initially_not_active = true;
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			if (initially_not_active)
+				pgstat_send_droplogicalslot(slot->data.database);
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* slot was active */
+		initially_not_active = false;
+
+		/* Send the other backend, a conflict recovery signal */
+
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1172,6 +1239,121 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	if (max_replication_slots <= 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4fc6ffb917..cd046df619 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3372,6 +3372,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..63642761c6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 17de5a6d0e..82605ce948 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -458,7 +463,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -476,7 +481,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2b1b68109f..4804700e21 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2438,6 +2438,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3000,6 +3003,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5102227a60..12e1e17bfa 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 464fa8d614..b5e936bbf9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5409,6 +5409,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2c82313550..66914dbbd4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -720,6 +720,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1600,6 +1601,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
+extern void pgstat_send_droplogicalslot(Oid dbOid);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_report_wal(void);
 extern bool pgstat_send_wal(bool force);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b12cc122a..b0e17d4e1d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.18.4

v12-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v12-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From e303dbb1fd04ed0a9b88093c5dca22a0f41380b8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 11:16:01 +0000
Subject: [PATCH v12 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 76 insertions(+), 19 deletions(-)
  15.3% src/backend/access/gist/
  20.9% src/backend/access/heap/
  12.4% src/backend/access/nbtree/
   7.6% src/backend/access/spgist/
   6.9% src/backend/utils/cache/
  20.0% src/include/access/
  14.2% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1ff1bf816f..1c89028a7f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..97d814b927 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..ac004f1258 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7954,12 +7954,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7995,6 +7996,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -8045,6 +8047,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8075,7 +8078,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8085,6 +8088,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ef9186ba7c..f48a4e4603 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -777,7 +777,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 530d924bff..e5d725e4ea 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183..cd6569955c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..90fa5dfc7c 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2062,6 +2064,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8eee1c1a83..1df89b559d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -347,6 +348,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -647,6 +651,11 @@ typedef struct PartitionedTableRdOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v12-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v12-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 811b20dfc402e8ee90e74457d809a08fea2b8483 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 11:15:15 +0000
Subject: [PATCH v12 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 98 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.8% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..f284318cd5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced.  Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75deb72..534e2566cf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..9db20278f6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1105,37 +1105,56 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..c4a4972669 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..06bcd2fc56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

#116

Fabrízio de Royes Mello

fabriziomello@gmail.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#115)

Re: Minimal logical decoding on standbys

On Tue, Mar 23, 2021 at 8:47 AM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

I have one remark regarding the conflicts:

The logical slots are dropped if a conflict is detected.

But, if the slot is not active before being dropped (say wal_level is

changed to < logical on master and a logical slot is not active on the
standby) then its corresponding
pg_stat_database_conflicts.confl_logicalslot is not incremented (as it
would be incremented "only" during the cancel of an "active" backend).

I think, it should be incremented in all the cases (active or not), what

do you think?

Good catch... IMHO it should be incremented as well!!!

I updated the patch to handle this scenario (see the new

pgstat_send_droplogicalslot() in
v12-0003-Handle-logical-slot-conflicts-on-standby.patch).

Perfect.

I also added more tests in 023_standby_logical_decoding_conflicts.pl to

verify that pg_stat_database_conflicts.confl_logicalslot is updated as
expected (see check_confl_logicalslot() in
v12-0004-New-TAP-test-for-logical-decoding-on-standby.patch).

Perfect.

Except this remark and the associated changes, then it looks good to me.

LGTM too... Reviewing new changes now to move it forward and make this
patch set ready for commiter review.

Regards,

--
Fabrízio de Royes Mello
PostgreSQL Developer at OnGres Inc. - https://ongres.com

#117

Fabrízio de Royes Mello

fabriziomello@gmail.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#116)

Re: Minimal logical decoding on standbys

On Tue, Mar 23, 2021 at 10:18 AM Fabrízio de Royes Mello <
fabriziomello@gmail.com> wrote:

LGTM too... Reviewing new changes now to move it forward and make this

patch set ready for commiter review.

According to the feature LGTM and all tests passed. Documentation is also
OK. Some minor comments:

+    <para>
+     A logical replication slot can also be created on a hot standby. To
prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot
gets
+     dropped.  Existing logical slots on standby also get dropped if
wal_level
+     on primary is reduced to less than 'logical'.
+    </para>

Remove extra space before "Existing logical slots..."

+ pg_stat_get_db_conflict_logicalslot(D.oid) AS
confl_logicalslot,

Move it to the end of pg_stat_database_conflicts columns

+ * is being reduced. Hence this extra check.

Remove extra space before "Hence this..."

+       /* Send the other backend, a conflict recovery signal */
+
+       SetInvalidVirtualTransactionId(vxid);

Remove extra empty line

+               if (restart_lsn % XLOG_BLCKSZ != 0)
+                   elog(ERROR, "invalid replay pointer");

Add an empty line after this "IF" for code readability

+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+                                       char *conflict_reason)
+{
+   int         i;
+   bool        found_conflict = false;
+
+   if (max_replication_slots <= 0)
+       return;

What about adding an "Assert(max_replication_slots >= 0);" before the
replication slots check?

One last thing is about the name of TAP tests, we should rename them
because there are other TAP tests starting with 022_ and 023_. It should be
renamed to:

src/test/recovery/t/022_standby_logical_decoding_xmins.pl ->
src/test/recovery/t/024_standby_logical_decoding_xmins.pl
src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
-> src/test/recovery/t/025_standby_logical_decoding_conflicts.pl

Regards,

--
Fabrízio de Royes Mello
PostgreSQL Developer at OnGres Inc. - https://ongres.com

#118

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#117)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/23/21 3:29 PM, Fabrízio de Royes Mello wrote:

On Tue, Mar 23, 2021 at 10:18 AM Fabrízio de Royes Mello
<fabriziomello@gmail.com <mailto:fabriziomello@gmail.com>> wrote:

LGTM too... Reviewing new changes now to move it forward and make

this patch set ready for commiter review.

According to the feature LGTM and all tests passed. Documentation is
also OK.

Thanks for the review!

Some minor comments:

+    <para>
+     A logical replication slot can also be created on a hot standby. 
To prevent
+     <command>VACUUM</command> from removing required rows from the 
system
+     catalogs, <varname>hot_standby_feedback</varname> should be set 
on the
+     standby. In spite of that, if any required rows get removed, the 
slot gets
+     dropped.  Existing logical slots on standby also get dropped if 
wal_level
+     on primary is reduced to less than 'logical'.
+    </para>

Remove extra space before "Existing logical slots..."

done in v13 attached.

+ pg_stat_get_db_conflict_logicalslot(D.oid) AS
confl_logicalslot,

Move it to the end of pg_stat_database_conflicts columns

done in v13 attached.

+ * is being reduced. Hence this extra check.

Remove extra space before "Hence this..."

done in v13 attached.

+       /* Send the other backend, a conflict recovery signal */
+
+       SetInvalidVirtualTransactionId(vxid);

Remove extra empty line

done in v13 attached.

+               if (restart_lsn % XLOG_BLCKSZ != 0)
+                   elog(ERROR, "invalid replay pointer");

Add an empty line after this "IF" for code readability

done in v13 attached.

+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+                                       char *conflict_reason)
+{
+   int         i;
+   bool        found_conflict = false;
+
+   if (max_replication_slots <= 0)
+       return;

What about adding an "Assert(max_replication_slots >= 0);" before the
replication slots check?

Makes sense, in v13 attached: Assert added and then also changed the
following if accordingly to "== 0".

One last thing is about the name of TAP tests, we should rename them
because there are other TAP tests starting with 022_ and 023_. It
should be renamed to:

src/test/recovery/t/022_standby_logical_decoding_xmins.pl
<http://022_standby_logical_decoding_xmins.pl> ->
src/test/recovery/t/024_standby_logical_decoding_xmins.pl
<http://024_standby_logical_decoding_xmins.pl>
src/test/recovery/t/023_standby_logical_decoding_conflicts.pl
<http://023_standby_logical_decoding_conflicts.pl>
-> src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
<http://025_standby_logical_decoding_conflicts.pl>

done in v13 attached.

Bertrand

Attachments:

v13-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v13-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 3e300822777f89907cf0abd7c90f03cabdd4b205 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 16:15:35 +0000
Subject: [PATCH v13 5/5] Doc changes describing details about logical
 decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d609..050befa55e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v13-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v13-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 016c94a82431558f5646e1b7626d3ac2d10322ce Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 16:14:55 +0000
Subject: [PATCH v13 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/024_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../025_standby_logical_decoding_conflicts.pl | 228 +++++++++++++++
 3 files changed, 537 insertions(+)
   5.8% src/test/perl/
  94.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e05993be..8a108e9e16 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2245,6 +2245,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..426a412b1f
--- /dev/null
+++ b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,228 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v13-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v13-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 0079290c83508708506dd9f326857254154ad893 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 16:13:35 +0000
Subject: [PATCH v13 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 ++
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c      |  22 ++++
 src/backend/replication/slot.c       | 183 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   2 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   1 +
 21 files changed, 315 insertions(+), 13 deletions(-)
   6.3% src/backend/access/heap/
   6.0% src/backend/access/transam/
   5.8% src/backend/access/
   4.9% src/backend/postmaster/
  48.3% src/backend/replication/
   6.4% src/backend/storage/ipc/
   8.0% src/backend/tcop/
   3.1% src/backend/utils/adt/
   6.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 19540206f9..eb5bb5209e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3980,6 +3980,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 97d814b927..b6c0d8b290 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ac004f1258..903b37a644 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8512,7 +8512,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8548,7 +8549,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8644,7 +8646,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8781,7 +8785,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a21cba362..c5b5d6b610 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10373,6 +10373,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7b..0d6c9e6926 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -946,7 +946,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7af7c2707..56902e124a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4639,6 +4639,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *	Tell the collector about a logical slot being dropped
+ *	due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+	PgStat_MsgRecoveryConflict msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dbOid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_send_bgwriter() -
  *
@@ -5113,6 +5131,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6996,6 +7015,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2ec7127947..4945dd1a4f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -719,6 +721,70 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+	bool initially_not_active;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	initially_not_active = true;
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			if (initially_not_active)
+				pgstat_send_droplogicalslot(slot->data.database);
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* slot was active */
+		initially_not_active = false;
+
+		/* Send the other backend, a conflict recovery signal */
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1173,6 +1239,123 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4fc6ffb917..cd046df619 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3372,6 +3372,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..63642761c6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 17de5a6d0e..82605ce948 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -458,7 +463,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -476,7 +481,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2b1b68109f..4804700e21 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2438,6 +2438,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3000,6 +3003,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5102227a60..12e1e17bfa 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 464fa8d614..b5e936bbf9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5409,6 +5409,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2c82313550..66914dbbd4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -720,6 +720,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1600,6 +1601,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
+extern void pgstat_send_droplogicalslot(Oid dbOid);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_report_wal(void);
 extern bool pgstat_send_wal(bool force);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b12cc122a..b0e17d4e1d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,7 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
-- 
2.18.4

v13-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v13-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 6ca1cab4064b859ec725c93f52e4f867f22c6adb Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 16:12:47 +0000
Subject: [PATCH v13 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 76 insertions(+), 19 deletions(-)
  15.3% src/backend/access/gist/
  20.9% src/backend/access/heap/
  12.4% src/backend/access/nbtree/
   7.6% src/backend/access/spgist/
   6.9% src/backend/utils/cache/
  20.0% src/include/access/
  14.2% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1ff1bf816f..1c89028a7f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..97d814b927 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..ac004f1258 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7954,12 +7954,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7995,6 +7996,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -8045,6 +8047,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8075,7 +8078,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8085,6 +8088,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ef9186ba7c..f48a4e4603 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -777,7 +777,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 530d924bff..e5d725e4ea 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183..cd6569955c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..90fa5dfc7c 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2062,6 +2064,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8eee1c1a83..1df89b559d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -347,6 +348,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -647,6 +651,11 @@ typedef struct PartitionedTableRdOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v13-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v13-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 66e4764ea02f6e9129d2d9fa91944c3b7fcda154 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 23 Mar 2021 16:11:38 +0000
Subject: [PATCH v13 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 58 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 99 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.7% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..2464c1ef17 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75deb72..534e2566cf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..2ec7127947 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1105,37 +1105,57 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..c4a4972669 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..06bcd2fc56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

#119

Fabrízio de Royes Mello

fabriziomello@gmail.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#118)

Re: Minimal logical decoding on standbys

done in v13 attached.

All tests passed and everything looks good to me... just a final minor fix
on regression tests:

diff --git a/src/test/regress/expected/rules.out
b/src/test/regress/expected/rules.out
index b0e17d4e1d..961ec869a6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,9 +1869,9 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
-    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,

We moved "pg_stat_database_conflicts.confl_logicalslot" to the end of the
column list but forgot to change the regression test expected result.

Regards,

--
Fabrízio de Royes Mello
PostgreSQL Developer at OnGres Inc. - https://ongres.com

#120

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#119)

5 attachment(s)

Re: Minimal logical decoding on standbys

On 3/23/21 11:05 PM, Fabrízio de Royes Mello wrote:

done in v13 attached.

All tests passed and everything looks good to me... just a final
minor fix on regression tests:

diff --git a/src/test/regress/expected/rules.out 
b/src/test/regress/expected/rules.out
index b0e17d4e1d..961ec869a6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,9 +1869,9 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
-    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,

We moved "pg_stat_database_conflicts.confl_logicalslot" to the end of
the column list but forgot to change the regression test expected result.

Thanks for pointing out, fixed in v14 attached.

Bertrand

Attachments:

v14-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v14-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 6f1548d77de2b9145a867816fb80c115c659b394 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 24 Mar 2021 06:26:16 +0000
Subject: [PATCH v14 5/5] Doc changes describing details about logical
 decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d609..050befa55e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v14-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v14-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From f23d5db394b012555097000bd5bbf64e5d46553e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 24 Mar 2021 06:25:34 +0000
Subject: [PATCH v14 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/024_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../025_standby_logical_decoding_conflicts.pl | 228 +++++++++++++++
 3 files changed, 537 insertions(+)
   5.8% src/test/perl/
  94.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e05993be..8a108e9e16 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2245,6 +2245,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..426a412b1f
--- /dev/null
+++ b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,228 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v14-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v14-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From c061f452aa6a2213052f6aaf8594f448f0376262 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 24 Mar 2021 06:24:33 +0000
Subject: [PATCH v14 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 ++
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c      |  22 ++++
 src/backend/replication/slot.c       | 183 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   2 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   3 +-
 21 files changed, 316 insertions(+), 14 deletions(-)
   6.3% src/backend/access/heap/
   6.0% src/backend/access/transam/
   5.8% src/backend/access/
   4.8% src/backend/postmaster/
  48.3% src/backend/replication/
   6.4% src/backend/storage/ipc/
   8.0% src/backend/tcop/
   3.1% src/backend/utils/adt/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 43c07da20e..9ad667366d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3980,6 +3980,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 97d814b927..b6c0d8b290 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ac004f1258..903b37a644 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8512,7 +8512,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8548,7 +8549,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8644,7 +8646,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8781,7 +8785,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index d40c7b5877..1a5c8959a1 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a21cba362..c5b5d6b610 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10373,6 +10373,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7b..0d6c9e6926 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -946,7 +946,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 60f45ccc4e..1c2d40fefc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4639,6 +4639,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *	Tell the collector about a logical slot being dropped
+ *	due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+	PgStat_MsgRecoveryConflict msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dbOid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_send_bgwriter() -
  *
@@ -5113,6 +5131,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -6996,6 +7015,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2ec7127947..4945dd1a4f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -719,6 +721,70 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+	bool initially_not_active;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	initially_not_active = true;
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			if (initially_not_active)
+				pgstat_send_droplogicalslot(slot->data.database);
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* slot was active */
+		initially_not_active = false;
+
+		/* Send the other backend, a conflict recovery signal */
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1173,6 +1239,123 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4fc6ffb917..cd046df619 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3372,6 +3372,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..63642761c6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 17de5a6d0e..82605ce948 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -425,7 +426,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -450,6 +452,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -458,7 +463,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -476,7 +481,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2b1b68109f..4804700e21 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2438,6 +2438,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3000,6 +3003,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5102227a60..12e1e17bfa 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 464fa8d614..b5e936bbf9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5409,6 +5409,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4543',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 87672e6f30..d6b8b36059 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -720,6 +720,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1600,6 +1601,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
+extern void pgstat_send_droplogicalslot(Oid dbOid);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_report_wal(void);
 extern bool pgstat_send_wal(bool force);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 4ae7dc33b8..27035b075b 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -40,6 +40,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b12cc122a..961ec869a6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1870,7 +1870,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v14-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v14-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 90ebf86ca3869d09d0408e8d3165d49979841be0 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 24 Mar 2021 06:22:33 +0000
Subject: [PATCH v14 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 76 insertions(+), 19 deletions(-)
  15.3% src/backend/access/gist/
  20.9% src/backend/access/heap/
  12.4% src/backend/access/nbtree/
   7.6% src/backend/access/spgist/
   6.9% src/backend/utils/cache/
  20.0% src/include/access/
  14.2% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1ff1bf816f..1c89028a7f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..97d814b927 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..ac004f1258 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7954,12 +7954,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -7995,6 +7996,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -8045,6 +8047,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8075,7 +8078,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8085,6 +8088,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702..5f54e3383b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -777,7 +777,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+		(void) log_heap_cleanup_info(rel, vacrelstats->latestRemovedXid);
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2..ebd35521b5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index a9ffca5183..cd6569955c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..90fa5dfc7c 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2062,6 +2064,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8eee1c1a83..1df89b559d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -347,6 +348,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -647,6 +651,11 @@ typedef struct PartitionedTableRdOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v14-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v14-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 1a2116152779081d1e18508e208d21adf85248ad Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 24 Mar 2021 06:21:45 +0000
Subject: [PATCH v14 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 58 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 99 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.7% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..2464c1ef17 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75deb72..534e2566cf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..2ec7127947 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1105,37 +1105,57 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..c4a4972669 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..06bcd2fc56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

#121

Fabrízio de Royes Mello

fabriziomello@gmail.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#120)

Re: Minimal logical decoding on standbys

On Wed, Mar 24, 2021 at 3:57 AM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

Thanks for pointing out, fixed in v14 attached.

Thanks... now everything is working as expected... changed the status to
Ready for Commiter:
https://commitfest.postgresql.org/32/2968/

Regards,

--
Fabrízio de Royes Mello
PostgreSQL Developer at OnGres Inc. - https://ongres.com

#122

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Fabrízio de Royes Mello (#121)

Re: Minimal logical decoding on standbys

On 3/25/21 12:01 AM, Fabrízio de Royes Mello wrote:

On Wed, Mar 24, 2021 at 3:57 AM Drouvot, Bertrand <bdrouvot@amazon.com
<mailto:bdrouvot@amazon.com>> wrote:

Thanks for pointing out, fixed in v14 attached.

Thanks... now everything is working as expected... changed the status
to Ready for Commiter:
https://commitfest.postgresql.org/32/2968/
<https://commitfest.postgresql.org/32/2968/>

Thanks!

I think this would be a great feature, so I am looking forward to
help/work on any comments/suggestions that they may have.

Bertrand

#123

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Drouvot, Bertrand (#122)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/25/21 8:51 AM, Drouvot, Bertrand wrote:

On 3/25/21 12:01 AM, Fabrízio de Royes Mello wrote:

On Wed, Mar 24, 2021 at 3:57 AM Drouvot, Bertrand
<bdrouvot@amazon.com <mailto:bdrouvot@amazon.com>> wrote:

Thanks for pointing out, fixed in v14 attached.

Thanks... now everything is working as expected... changed the status
to Ready for Commiter:
https://commitfest.postgresql.org/32/2968/
<https://commitfest.postgresql.org/32/2968/>

Thanks!

I think this would be a great feature, so I am looking forward to
help/work on any comments/suggestions that they may have.

Just needed a minor rebase due to 2 new conflicts with:

* b4af70cb21: in vacuum_log_cleanup_info() (see new
v15-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch)
* 43620e3286: oid conflict with pg_log_backend_memory_contexts() and
pg_stat_get_db_conflict_snapshot() (see new
v15-0003-Handle-logical-slot-conflicts-on-standby.patch)

New v15 attached is passing "make check" and the 2 new associated TAP tests.

Bertrand

Attachments:

v15-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v15-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From bfd39b3e6db4c5a140f116d2d1f5fd99a4b4de60 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 12:05:43 +0000
Subject: [PATCH v15 5/5] Doc changes describing details about logical
 decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5d049cdc68..0f505f7615 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v15-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v15-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 7518f34dfebe7de37d9c81dbf41e1b7f3c6f29fc Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 12:04:55 +0000
Subject: [PATCH v15 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/024_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../025_standby_logical_decoding_conflicts.pl | 228 +++++++++++++++
 3 files changed, 537 insertions(+)
   5.8% src/test/perl/
  94.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ec202f1b6e..dce4571f2c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2455,6 +2455,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..426a412b1f
--- /dev/null
+++ b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,228 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v15-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v15-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 0fe0ea265c830125b6f9e12cd1cdebf773b44354 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 12:03:32 +0000
Subject: [PATCH v15 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  13 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 ++
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c      |  22 ++++
 src/backend/replication/slot.c       | 183 +++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 ++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   2 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   3 +-
 21 files changed, 316 insertions(+), 14 deletions(-)
   6.3% src/backend/access/heap/
   6.0% src/backend/access/transam/
   5.8% src/backend/access/
   4.8% src/backend/postmaster/
  48.3% src/backend/replication/
   6.4% src/backend/storage/ipc/
   8.0% src/backend/tcop/
   3.1% src/backend/utils/adt/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 56018745c8..850cc97b1b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3985,6 +3985,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 97d814b927..b6c0d8b290 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..b1ff596eb7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ce6b66aa58..ca769b7862 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8522,7 +8522,8 @@ heap_xlog_cleanup_info(XLogReaderState *record)
 	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
 
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, xlrec->node);
 
 	/*
 	 * Actual operation is a no-op. Record type exists to provide a means for
@@ -8558,7 +8559,8 @@ heap_xlog_clean(XLogReaderState *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8654,7 +8656,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8791,7 +8795,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a21cba362..c5b5d6b610 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10373,6 +10373,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316..374007df9d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1015,7 +1015,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ba776e789..03c5dbea48 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *	Tell the collector about a logical slot being dropped
+ *	due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+	PgStat_MsgRecoveryConflict msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dbOid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_send_bgwriter() -
  *
@@ -3402,6 +3420,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5290,6 +5309,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2ec7127947..4945dd1a4f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -719,6 +721,70 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+	bool initially_not_active;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	initially_not_active = true;
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			if (initially_not_active)
+				pgstat_send_droplogicalslot(slot->data.database);
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* slot was active */
+		initially_not_active = false;
+
+		/* Send the other backend, a conflict recovery signal */
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1173,6 +1239,123 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e113a85aed..9054715e82 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3372,6 +3372,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eac6895141..a3fa6bdc01 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 1465ee44a1..d155a1de20 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 330ec5b028..c5edb7f4f7 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2441,6 +2441,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3011,6 +3014,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9ffbca685c..fe7f3105f2 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4309fa40dd..3afe542864 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5432,6 +5432,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 7cd137506e..19d13caab5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,6 +719,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1064,6 +1065,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
+extern void pgstat_send_droplogicalslot(Oid dbOid);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_report_wal(void);
 extern bool pgstat_send_wal(bool force);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b59a7b4a5..5913dc96b9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1870,7 +1870,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v15-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v15-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 55bd558dab36eed23d7586ba36410e2d2712c821 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 12:02:05 +0000
Subject: [PATCH v15 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        | 10 +++++++---
 src/backend/access/heap/vacuumlazy.c    |  2 +-
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  8 ++++++--
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 76 insertions(+), 19 deletions(-)
  15.5% src/backend/access/gist/
  19.9% src/backend/access/heap/
  12.6% src/backend/access/nbtree/
   7.7% src/backend/access/spgist/
   6.9% src/backend/utils/cache/
  20.2% src/include/access/
  14.4% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1ff1bf816f..1c89028a7f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,7 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -867,7 +867,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..97d814b927 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 595310ba1b..ce6b66aa58 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7964,12 +7964,13 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * see comments for vacuum_log_cleanup_info().
  */
 XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
 {
 	xl_heap_cleanup_info xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.node = rnode;
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
+	xlrec.node = rel->rd_node;
 	xlrec.latestRemovedXid = latestRemovedXid;
 
 	XLogBeginInsert();
@@ -8005,6 +8006,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
@@ -8055,6 +8057,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8085,7 +8088,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8095,6 +8098,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c259693a8b..4b55ef9d20 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -804,7 +804,7 @@ vacuum_log_cleanup_info(LVRelState *vacrel)
 	 * No need to write the record at all unless it contains a valid value
 	 */
 	if (TransactionIdIsValid(vacrel->latestRemovedXid))
-		(void) log_heap_cleanup_info(vacrel->rel->rd_node,
+		(void) log_heap_cleanup_info(vacrel->rel,
 									 vacrel->latestRemovedXid);
 }
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2..ebd35521b5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..90fa5dfc7c 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2062,6 +2064,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_clean
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
  */
 typedef struct xl_heap_cleanup_info
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	TransactionId latestRemovedXid;
 } xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -397,7 +401,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+extern XLogRecPtr log_heap_cleanup_info(Relation rel,
 										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
@@ -416,7 +420,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9a3a03e520..3405070d63 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -622,6 +626,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v15-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v15-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 827295f74aff9c627ee722f541a6c7cc6d4133cf Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 11:59:23 +0000
Subject: [PATCH v15 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 58 +++++++++++++++--------
 src/backend/replication/walsender.c       | 10 ++--
 src/include/access/xlog.h                 |  1 +
 6 files changed, 99 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.7% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 97be4b0f23..47d79167da 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f6803637b..c6d11c070f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..2ec7127947 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1105,37 +1105,57 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4bf8a18e01..512ef7c1ca 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..06bcd2fc56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

#124

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Drouvot, Bertrand (#123)

Re: Minimal logical decoding on standbys

Hi,

On 2021-04-06 14:30:29 +0200, Drouvot, Bertrand wrote:

From 827295f74aff9c627ee722f541a6c7cc6d4133cf Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 11:59:23 +0000
Subject: [PATCH v15 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.

I think more people have worked on this by now...

Does this strike you as an accurate description?

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas

--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires a database connection")));
- /* ----
- * TODO: We got to change that someday soon...
- *
- * There's basically three things missing to allow this:
- * 1) We need to be able to correctly and quickly identify the timeline a
- * LSN belongs to
- * 2) We need to force hot_standby_feedback to be enabled at all times so
- * the primary cannot remove rows we need.
- * 3) support dropping replication slots referring to a database, in
- * dbase_redo. There can't be any active ones due to HS recovery
- * conflicts, so that should be relatively easy.
- * ----
- */
if (RecoveryInProgress())
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("logical decoding cannot be used while in recovery")));

Maybe I am just missing something right now, and maybe I'm being a bit
overly pedantic, but I don't immediately see how 0001 is correct without
0002 and 0003? I think it'd be better to first introduce the conflict
information, then check for conflicts, and only after that allow
decoding on standbys?

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
ReadControlFile();
}

+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+

This strikes me as error-prone - there's nothing in the function name
that this should mainly (only?) be used during recovery...

+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}

This seems like a layering violation to me. I don't think stuff like
this should be outside of xlog[reader].c, and definitely not in
ReplicationSlotReserveWal().

Relevant discussion (which totally escaped my mind):
/messages/by-id/CAJ3gD9csOr0LoYoMK9NnfBk0RZmvHXcJAFWFd2EuL=NOfz7PVA@mail.gmail.com

+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
XLogRecPtr	flushptr;
- /* start at current insert position */
- restart_lsn = GetXLogInsertRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
-
/* make sure we have enough information to start */
flushptr = LogStandbySnapshot();

/* and make sure it's fsynced to disk */
XLogFlush(flushptr);
}
- else
- {
- restart_lsn = GetRedoRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
- }

/* prevent WAL removal as fast as possible */
ReplicationSlotsComputeRequiredLSN();

I think I'd move the LogStandbySnapshot() piece out of the entire
loop. There's no reason for logging multiple ones if we then just end up
failing because of the XLogGetLastRemovedSegno() check.

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
*/
typedef struct xl_heap_clean
{
+	bool		onCatalogTable;
TransactionId latestRemovedXid;
uint16		nredirected;
uint16		ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
*/
typedef struct xl_heap_cleanup_info
{
+	bool		onCatalogTable;
RelFileNode node;
TransactionId latestRemovedXid;
} xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
*/
typedef struct xl_heap_freeze_page
{
+	bool		onCatalogTable;
TransactionId cutoff_xid;
uint16		ntuples;
} xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
*/
typedef struct xl_heap_visible
{
+	bool		onCatalogTable;
TransactionId cutoff_xid;
uint8		flags;
} xl_heap_visible;

Reminder to self: This needs a WAL version bump.

diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9a3a03e520..3405070d63 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
#include "access/tupdesc.h"
#include "access/xlog.h"
+#include "catalog/catalog.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
#include "catalog/pg_publication.h"

Not clear why this is in this patch?

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ba776e789..03c5dbea48 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
pgstat_send(&msg, sizeof(msg));
}

+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *	Tell the collector about a logical slot being dropped
+ *	due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+	PgStat_MsgRecoveryConflict msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dbOid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}

Why do we have this in adition to pgstat_report_replslot_drop()? ISTM
that we should instead add a reason parameter to
pgstat_report_replslot_drop()?

+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed.

I don't think the past tense is correct - the rows better not be removed
yet on the standby, otherwise we'd potentially do something random in
decoding.

diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);

Several of these don't actually seem to be used?

+########################
+# Initialize master node
+########################

(I'll rename these to primary/replica)

+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});

Why is this done on the primary, rather than on the standby?

+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.

I don't know what that means.

+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}

Forking 2000 psql processes is pretty expensive, especially on slower
machines. What is this supposed to test?

+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');

Nothing is being replayed...

+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+

I am thinking of removing this test. It doesn't seem to test anything
really related to the issue at hand, and seems complicated (needing to
update datallowcon, manually triggering checkpoints, parsing
pg_controldata output).

+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+

Given that this hardcodes a specific number of conflicting slots etc,
there doesn't seem much point in making this a function...

+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}

It's a bad idea to not have timeouts in things like this - if there's a
problem, it'll lead to the test never returning. Things like
poll_query_until() have timeouts to deal with this, but this doesn't.

+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}

Why do we need to use eval{} for things like checking if a program
finished?

@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
may consume changes from a slot at any given time.
</para>

+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>

I think this should add that it's very advisable to use a physical slot
between primary and standby. Otherwise hot_standby_feedback will work,
but only while the connection is alive - as soon as it breaks, a node
gets restarted, ...

Greetings,

Andres Freund

#125

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Andres Freund (#124)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi Andres,

On 4/6/21 8:02 PM, Andres Freund wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hi,

On 2021-04-06 14:30:29 +0200, Drouvot, Bertrand wrote:

From 827295f74aff9c627ee722f541a6c7cc6d4133cf Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 11:59:23 +0000
Subject: [PATCH v15 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.

I think more people have worked on this by now...

Does this strike you as an accurate description?

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires a database connection")));
- /* ----
- * TODO: We got to change that someday soon...
- *
- * There's basically three things missing to allow this:
- * 1) We need to be able to correctly and quickly identify the timeline a
- * LSN belongs to
- * 2) We need to force hot_standby_feedback to be enabled at all times so
- * the primary cannot remove rows we need.
- * 3) support dropping replication slots referring to a database, in
- * dbase_redo. There can't be any active ones due to HS recovery
- * conflicts, so that should be relatively easy.
- * ----
- */
if (RecoveryInProgress())
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("logical decoding cannot be used while in recovery")));
Maybe I am just missing something right now, and maybe I'm being a bit
overly pedantic, but I don't immediately see how 0001 is correct without
0002 and 0003? I think it'd be better to first introduce the conflict
information, then check for conflicts, and only after that allow
decoding on standbys?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
ReadControlFile();
}
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+     return ControlFile->wal_level;
+}
+
This strikes me as error-prone - there's nothing in the function name
that this should mainly (only?) be used during recovery...
+             if (SlotIsPhysical(slot))
+                     restart_lsn = GetRedoRecPtr();
+             else if (RecoveryInProgress())
+             {
+                     restart_lsn = GetXLogReplayRecPtr(NULL);
+                     /*
+                      * Replay pointer may point one past the end of the record. If that
+                      * is a XLOG page boundary, it will not be a valid LSN for the
+                      * start of a record, so bump it up past the page header.
+                      */
+                     if (!XRecOffIsValid(restart_lsn))
+                     {
+                             if (restart_lsn % XLOG_BLCKSZ != 0)
+                                     elog(ERROR, "invalid replay pointer");
+
+                             /* For the first page of a segment file, it's a long header */
+                             if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+                                     restart_lsn += SizeOfXLogLongPHD;
+                             else
+                                     restart_lsn += SizeOfXLogShortPHD;
+                     }
+             }
This seems like a layering violation to me. I don't think stuff like
this should be outside of xlog[reader].c, and definitely not in
ReplicationSlotReserveWal().

Relevant discussion (which totally escaped my mind):
/messages/by-id/CAJ3gD9csOr0LoYoMK9NnfBk0RZmvHXcJAFWFd2EuL=NOfz7PVA@mail.gmail.com
+             else
+                     restart_lsn = GetXLogInsertRecPtr();
+
+             SpinLockAcquire(&slot->mutex);
+             slot->data.restart_lsn = restart_lsn;
+             SpinLockRelease(&slot->mutex);
+
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
XLogRecPtr      flushptr;
- /* start at current insert position */
- restart_lsn = GetXLogInsertRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
-
/* make sure we have enough information to start */
flushptr = LogStandbySnapshot();

/* and make sure it's fsynced to disk */
XLogFlush(flushptr);
}
- else
- {
- restart_lsn = GetRedoRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
- }

/* prevent WAL removal as fast as possible */
ReplicationSlotsComputeRequiredLSN();
I think I'd move the LogStandbySnapshot() piece out of the entire
loop. There's no reason for logging multiple ones if we then just end up
failing because of the XLogGetLastRemovedSegno() check.
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
*/
typedef struct xl_heap_clean
{
+     bool            onCatalogTable;
TransactionId latestRemovedXid;
uint16          nredirected;
uint16          ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
*/
typedef struct xl_heap_cleanup_info
{
+     bool            onCatalogTable;
RelFileNode node;
TransactionId latestRemovedXid;
} xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
*/
typedef struct xl_heap_freeze_page
{
+     bool            onCatalogTable;
TransactionId cutoff_xid;
uint16          ntuples;
} xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
*/
typedef struct xl_heap_visible
{
+     bool            onCatalogTable;
TransactionId cutoff_xid;
uint8           flags;
} xl_heap_visible;
Reminder to self: This needs a WAL version bump.
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9a3a03e520..3405070d63 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
#include "access/tupdesc.h"
#include "access/xlog.h"
+#include "catalog/catalog.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
#include "catalog/pg_publication.h"
Not clear why this is in this patch?
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ba776e789..03c5dbea48 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
pgstat_send(&msg, sizeof(msg));
}
+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *   Tell the collector about a logical slot being dropped
+ *   due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+     PgStat_MsgRecoveryConflict msg;
+
+     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+     msg.m_databaseid = dbOid;
+     msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+     pgstat_send(&msg, sizeof(msg));
+}
Why do we have this in adition to pgstat_report_replslot_drop()? ISTM
that we should instead add a reason parameter to
pgstat_report_replslot_drop()?
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed.
I don't think the past tense is correct - the rows better not be removed
yet on the standby, otherwise we'd potentially do something random in
decoding.
diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
Several of these don't actually seem to be used?
+########################
+# Initialize master node
+########################
(I'll rename these to primary/replica)
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
Why is this done on the primary, rather than on the standby?
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
I don't know what that means.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
Forking 2000 psql processes is pretty expensive, especially on slower
machines. What is this supposed to test?
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+     qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
Nothing is being replayed...
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+     $node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+     $node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+     or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+     if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+     {
+             $oldestXid = $1;
+     }
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+        'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+     "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
I am thinking of removing this test. It doesn't seem to test anything
really related to the issue at hand, and seems complicated (needing to
update datallowcon, manually triggering checkpoints, parsing
pg_controldata output).
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+     my ($node, $slotname, $check_expr) = @_;
+
+     $node->poll_query_until(
+             'postgres', qq[
+             SELECT $check_expr
+             FROM pg_catalog.pg_replication_slots
+             WHERE slot_name = '$slotname';
+     ]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+     ok( $node_standby->poll_query_until(
+             'postgres',
+             "select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+             'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
Given that this hardcodes a specific number of conflicting slots etc,
there doesn't seem much point in making this a function...
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+     my $slot_user_handle;
+
+     # make sure activeslot is in use
+     print "starting pg_recvlogical\n";
+     $slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+     while (!$node_standby->slot('activeslot')->{'active_pid'})
+     {
+             usleep(100_000);
+             print "waiting for slot to become active\n";
+     }
+     return $slot_user_handle;
+}
It's a bad idea to not have timeouts in things like this - if there's a
problem, it'll lead to the test never returning. Things like
poll_query_until() have timeouts to deal with this, but this doesn't.
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+     my ($slot_user_handle) = @_;
+     my $return;
+
+     is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+     is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+     # our client should've terminated in response to the walsender error
+     eval {
+             $slot_user_handle->finish;
+     };
+     $return = $?;
+     cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+     if ($return) {
+             like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+             like($stderr, qr/must be dropped/, 'recvlogical error detail');
+     }
Why do we need to use eval{} for things like checking if a program
finished?
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
may consume changes from a slot at any given time.
</para>
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
I think this should add that it's very advisable to use a physical slot
between primary and standby. Otherwise hot_standby_feedback will work,
but only while the connection is alive - as soon as it breaks, a node
gets restarted, ...

Greetings,

Andres Freund

Thanks for your feedback!, I'll look at it.

But prior to that, I am sharing v16 (a rebase of v15 needed due to
8523492d4e).

Bertrand

Attachments:

v16-0001-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v16-0001-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 94755d3cf6c730ba3b17fecd328b07beb4baee66 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bdrouvot@amazon.com>
Date: Wed, 7 Apr 2021 07:44:17 +0000
Subject: [PATCH v16 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.
---
 src/backend/access/transam/xlog.c         | 11 ++++++
 src/backend/replication/logical/decode.c  | 22 +++++++++++-
 src/backend/replication/logical/logical.c | 37 +++++++++++---------
 src/backend/replication/slot.c            | 58 +++++++++++++++++++++----------
 src/backend/replication/walsender.c       | 10 +++---
 src/include/access/xlog.h                 |  1 +
 6 files changed, 99 insertions(+), 40 deletions(-)
   5.6% src/backend/access/transam/
  45.7% src/backend/replication/logical/
  47.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1d4415a43..b335a262e3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9aab713684..0d182ac219 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f6803637b..c6d11c070f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..2ec7127947 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1105,37 +1105,57 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+		{
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
+		}
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
+
 		if (!RecoveryInProgress() && SlotIsLogical(slot))
 		{
 			XLogRecPtr	flushptr;
 
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
 		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4bf8a18e01..512ef7c1ca 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..06bcd2fc56 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevel(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.16.6

v16-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v16-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From bd898a6c556b5c6742721dcec6d5378e3f316ffd Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bdrouvot@amazon.com>
Date: Wed, 7 Apr 2021 08:36:32 +0000
Subject: [PATCH v16 5/5] Doc changes describing details about logical
 decoding.

---
 doc/src/sgml/logicaldecoding.sgml | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5d049cdc68..0f505f7615 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.16.6

v16-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v16-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 5a5e0cdb3975ba355f0e7b5e5a771bc876cc6d32 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bdrouvot@amazon.com>
Date: Wed, 7 Apr 2021 08:35:42 +0000
Subject: [PATCH v16 4/5] New TAP test for logical decoding on standby.

This test was originally written by Craig Ringer, then
extended/modified by me, to test various slot conflict scenarios.

Authors: Craig Ringer, Amit Khandekar.
---
 src/test/perl/PostgresNode.pm                      |  37 +++
 .../t/024_standby_logical_decoding_xmins.pl        | 272 +++++++++++++++++++++
 .../t/025_standby_logical_decoding_conflicts.pl    | 228 +++++++++++++++++
 3 files changed, 537 insertions(+)
   5.8% src/test/perl/
  94.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 598906ad64..f98bee2256 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2532,6 +2532,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..426a412b1f
--- /dev/null
+++ b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,228 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.16.6

v16-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v16-0003-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From cf2311bc09caef996febcdfa3eab558b4a48fd88 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bdrouvot@amazon.com>
Date: Wed, 7 Apr 2021 08:34:02 +0000
Subject: [PATCH v16 3/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Amit Khandekar, reviewed by Andres Freund.
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  10 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 +++
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c      |  22 +++++
 src/backend/replication/slot.c       | 183 +++++++++++++++++++++++++++++++++++
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 ++-
 src/backend/tcop/postgres.c          |  22 +++++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   2 +
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   3 +-
 21 files changed, 314 insertions(+), 13 deletions(-)
   4.7% src/backend/access/heap/
   6.1% src/backend/access/transam/
   5.9% src/backend/access/
   4.9% src/backend/postmaster/
  49.1% src/backend/replication/
   6.5% src/backend/storage/ipc/
   8.1% src/backend/tcop/
   3.1% src/backend/utils/adt/
   6.5% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 56018745c8..850cc97b1b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3985,6 +3985,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index af35a991fc..f91d4c36e2 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 100901117e..657c525884 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8451,7 +8451,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8621,7 +8622,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8758,7 +8761,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b335a262e3..90c86b219f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10369,6 +10369,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316..374007df9d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1015,7 +1015,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ba776e789..03c5dbea48 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *	Tell the collector about a logical slot being dropped
+ *	due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+	PgStat_MsgRecoveryConflict msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dbOid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_send_bgwriter() -
  *
@@ -3402,6 +3420,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5290,6 +5309,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2ec7127947..4945dd1a4f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -103,6 +104,7 @@ static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -719,6 +721,70 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+	bool initially_not_active;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	initially_not_active = true;
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			if (initially_not_active)
+				pgstat_send_droplogicalslot(slot->data.database);
+			ReplicationSlotDropPtr(slot);
+			break;
+		}
+
+		/* slot was active */
+		initially_not_active = false;
+
+		/* Send the other backend, a conflict recovery signal */
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1173,6 +1239,123 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed. Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf776286de..478fc4b77b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3377,6 +3377,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eac6895141..a3fa6bdc01 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 1465ee44a1..d155a1de20 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 330ec5b028..c5edb7f4f7 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2441,6 +2441,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3011,6 +3014,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9ffbca685c..fe7f3105f2 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1494,6 +1494,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1537,6 +1552,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4309fa40dd..3afe542864 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5432,6 +5432,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 7cd137506e..19d13caab5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,6 +719,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1064,6 +1065,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
+extern void pgstat_send_droplogicalslot(Oid dbOid);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_report_wal(void);
 extern bool pgstat_send_wal(bool force);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b59a7b4a5..5913dc96b9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1870,7 +1870,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.16.6

v16-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v16-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From feb14fde6c44ff405e50cb1b10758fd35bb36ec3 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bdrouvot@amazon.com>
Date: Wed, 7 Apr 2021 08:29:39 +0000
Subject: [PATCH v16 2/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Andres Freund.
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  2 ++
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 16 ++++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 70 insertions(+), 15 deletions(-)
  17.6% src/backend/access/gist/
  12.9% src/backend/access/heap/
  14.2% src/backend/access/nbtree/
   8.7% src/backend/access/spgist/
   7.9% src/backend/utils/cache/
  19.2% src/include/access/
  16.3% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8dcd53c457..8ba6178a5f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..5711952fc7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/buf_internals.h"
 #include "storage/lwlock.h"
@@ -398,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9cbc161d7a..100901117e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7973,6 +7973,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8003,7 +8004,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8013,6 +8014,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f75502ca2c..b7d877f776 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2..ebd35521b5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..90fa5dfc7c 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,7 +18,9 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_amop.h"
@@ -2062,6 +2064,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9a3a03e520..3405070d63 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -622,6 +626,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.16.6

#126

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Andres Freund (#124)

Re: Minimal logical decoding on standbys

Hi,

I think I'll remove the "Catalog xmins should advance after standby
logical slot fetches the changes." test. For one, it takes a long time
(due to the 2000 psqls). But more importantly, it's simply not testing
anything that's reliable:

1) There's no guarantee that I can see that catalog_xmin needs to
increase, just because we called pg_logical_slot_get_changes() once.

2) $node_master->wait_for_catchup($node_standby, 'replay',
$node_master->lsn('flush')); is definitely not OK. It only happens to
work by accident / the 2000 iterations. There might not be any logical
changes associated with that LSN, so there'd might not be anything to
replay. That's especially true for the second wait_for_catchup - there
haven't been any logical changes since the last wait.

The test hangs reliably for me if I replace the 2000 with 2. Kinda looks
like somebody just tried to add more and more inserts to make the test
not fail, without addressing the reliability issues. That kind of thing
rarely works out well, because it tends to be very machine specific to
get the timing right. And it makes the test take forever.

TBH, most of 024_standby_logical_decoding_xmins.pl looks like they've
been minimally hacked up the tests from Craig's quite different patch,
without adjusting them. There's stuff like:

# Create new slots on the standby, ignoring the ones on the master completely.
#
# This must succeed since we know we have a catalog_xmin reservation. We
# might've already sent hot standby feedback to advance our physical slot's
# catalog_xmin but not received the corresponding xlog for the catalog xmin
# advance, in which case we'll create a slot that isn't usable. The calling
# application can prevent this by creating a temporary slot on the master to
# lock in its catalog_xmin. For a truly race-free solution we'd need
# master-to-standby hot_standby_feedback replies.
#
# In this case it won't race because there's no concurrent activity on the
# master.
#

This issue doesn't exist in the patch.

There's also no test for a recovery conflict due to row removal. Despite
that being a substantial part of the patchset.

I'm tempted to throw out 024 - all of its tests seem fragile and prove
little. And then add a few more tests to 025 (and renaming it).

Greetings,

Andres Freund

#127

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Andres Freund (#126)

Re: Minimal logical decoding on standbys

Hi,

On 2021-04-07 10:09:54 -0700, Andres Freund wrote:

There's also no test for a recovery conflict due to row removal. Despite
that being a substantial part of the patchset.

Another aspect that wasn't tested *at all*: Whether logical decoding
actually produces useful and correct results.

I'm tempted to throw out 024 - all of its tests seem fragile and prove
little. And then add a few more tests to 025 (and renaming it).

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via walsender
will typically not process the records until further WAL writes come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this point. I
think we might be able to live with 10s delay issue for one release, but
it sure is ugly :(.

Greetings,

Andres Freund

#128

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Andres Freund (#127)

Re: Minimal logical decoding on standbys

Hi,

On 2021-04-07 13:32:18 -0700, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via walsender
will typically not process the records until further WAL writes come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this point. I
think we might be able to live with 10s delay issue for one release, but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

- There's about which timeline to use. If you use pg_recvlogical and you
restart the server, you'll see errors like:

pg_recvlogical: error: unexpected termination of replication stream: ERROR: requested WAL segment 000000000000000000000003 has already been removed

the real filename is 000000010000000000000003 - i.e. the timeline is
0.

This isn't too hard to fix, but definitely needs fixing.

- ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
leading us to drop a slot that has been created since we signalled a
recovery conflict. See
/messages/by-id/20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de
for some very similar issues.

- Given the precedent of max_slot_wal_keep_size, I think it's wrong to
just drop the logical slots. Instead we should just mark them as
invalid, like InvalidateObsoleteReplicationSlots().

- There's no tests covering timeline switches, what happens if there's a
promotion if logical decoding is currently ongoing.

- The way ResolveRecoveryConflictWithLogicalSlots() builds the error
message is not good (and I've complained about it before...).

Unfortunately I think the things I have found are too many for me to
address within the given time. I'll send a version with a somewhat
polished set of the changes I made in the next few days...

Greetings,

Andres Freund

#129

Drouvot, Bertrand

bdrouvot@amazon.com

almost 5 years ago

In reply to: Andres Freund (#124)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

Thanks for the feedback!

On 4/6/21 8:02 PM, Andres Freund wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hi,

On 2021-04-06 14:30:29 +0200, Drouvot, Bertrand wrote:

From 827295f74aff9c627ee722f541a6c7cc6d4133cf Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 6 Apr 2021 11:59:23 +0000
Subject: [PATCH v15 1/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Andres Freund and Amit Khandekar.

I think more people have worked on this by now...

Does this strike you as an accurate description?

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas

Yes it looks like, adding Fabrizio as reviewer as well.

--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires a database connection")));
- /* ----
- * TODO: We got to change that someday soon...
- *
- * There's basically three things missing to allow this:
- * 1) We need to be able to correctly and quickly identify the timeline a
- * LSN belongs to
- * 2) We need to force hot_standby_feedback to be enabled at all times so
- * the primary cannot remove rows we need.
- * 3) support dropping replication slots referring to a database, in
- * dbase_redo. There can't be any active ones due to HS recovery
- * conflicts, so that should be relatively easy.
- * ----
- */
if (RecoveryInProgress())
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("logical decoding cannot be used while in recovery")));
Maybe I am just missing something right now, and maybe I'm being a bit
overly pedantic, but I don't immediately see how 0001 is correct without
0002 and 0003? I think it'd be better to first introduce the conflict
information, then check for conflicts, and only after that allow
decoding on standbys?

RIght, changing the order in v17 attached.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..6a21cba362 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
ReadControlFile();
}

+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevel(void)
+{
+     return ControlFile->wal_level;
+}
+

This strikes me as error-prone - there's nothing in the function name
that this should mainly (only?) be used during recovery...

renamed to GetActiveWalLevelOnStandby().

+             if (SlotIsPhysical(slot))
+                     restart_lsn = GetRedoRecPtr();
+             else if (RecoveryInProgress())
+             {
+                     restart_lsn = GetXLogReplayRecPtr(NULL);
+                     /*
+                      * Replay pointer may point one past the end of the record. If that
+                      * is a XLOG page boundary, it will not be a valid LSN for the
+                      * start of a record, so bump it up past the page header.
+                      */
+                     if (!XRecOffIsValid(restart_lsn))
+                     {
+                             if (restart_lsn % XLOG_BLCKSZ != 0)
+                                     elog(ERROR, "invalid replay pointer");
+
+                             /* For the first page of a segment file, it's a long header */
+                             if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+                                     restart_lsn += SizeOfXLogLongPHD;
+                             else
+                                     restart_lsn += SizeOfXLogShortPHD;
+                     }
+             }

This seems like a layering violation to me. I don't think stuff like
this should be outside of xlog[reader].c, and definitely not in
ReplicationSlotReserveWal().

Moved the bump to GetXLogReplayRecPtr(), does that make more sense or
did you have something else in mind?

Relevant discussion (which totally escaped my mind):
/messages/by-id/CAJ3gD9csOr0LoYoMK9NnfBk0RZmvHXcJAFWFd2EuL=NOfz7PVA@mail.gmail.com
+             else
+                     restart_lsn = GetXLogInsertRecPtr();
+
+             SpinLockAcquire(&slot->mutex);
+             slot->data.restart_lsn = restart_lsn;
+             SpinLockRelease(&slot->mutex);
+
if (!RecoveryInProgress() && SlotIsLogical(slot))
{
XLogRecPtr      flushptr;
- /* start at current insert position */
- restart_lsn = GetXLogInsertRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
-
/* make sure we have enough information to start */
flushptr = LogStandbySnapshot();

/* and make sure it's fsynced to disk */
XLogFlush(flushptr);
}
- else
- {
- restart_lsn = GetRedoRecPtr();
- SpinLockAcquire(&slot->mutex);
- slot->data.restart_lsn = restart_lsn;
- SpinLockRelease(&slot->mutex);
- }

/* prevent WAL removal as fast as possible */
ReplicationSlotsComputeRequiredLSN();
I think I'd move the LogStandbySnapshot() piece out of the entire
loop. There's no reason for logging multiple ones if we then just end up
failing because of the XLogGetLastRemovedSegno() check.

Right, moved it outside of the loop.

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..6c4c26c2fe 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -239,6 +239,7 @@ typedef struct xl_heap_update
*/
typedef struct xl_heap_clean
{
+     bool            onCatalogTable;
TransactionId latestRemovedXid;
uint16          nredirected;
uint16          ndead;
@@ -254,6 +255,7 @@ typedef struct xl_heap_clean
*/
typedef struct xl_heap_cleanup_info
{
+     bool            onCatalogTable;
RelFileNode node;
TransactionId latestRemovedXid;
} xl_heap_cleanup_info;
@@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
*/
typedef struct xl_heap_freeze_page
{
+     bool            onCatalogTable;
TransactionId cutoff_xid;
uint16          ntuples;
} xl_heap_freeze_page;
@@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
*/
typedef struct xl_heap_visible
{
+     bool            onCatalogTable;
TransactionId cutoff_xid;
uint8           flags;
} xl_heap_visible;

Reminder to self: This needs a WAL version bump.

diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9a3a03e520..3405070d63 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
#include "access/tupdesc.h"
#include "access/xlog.h"
+#include "catalog/catalog.h"
#include "catalog/pg_class.h"
#include "catalog/pg_index.h"
#include "catalog/pg_publication.h"

Not clear why this is in this patch?

It's needed for IsCatalogRelation() call in
RelationIsAccessibleInLogicalDecoding() and RelationIsLogicallyLogged().

So instead, in v17 attached i removed the new includes of catalog.h as
it makes more sense to me to keep this new one in rel.h.

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ba776e789..03c5dbea48 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
pgstat_send(&msg, sizeof(msg));
}

+/* ----------
+ * pgstat_send_droplogicalslot() -
+ *
+ *   Tell the collector about a logical slot being dropped
+ *   due to conflict.
+ * ----------
+ */
+void
+pgstat_send_droplogicalslot(Oid dbOid)
+{
+     PgStat_MsgRecoveryConflict msg;
+
+     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+     msg.m_databaseid = dbOid;
+     msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+     pgstat_send(&msg, sizeof(msg));
+}

Why do we have this in adition to pgstat_report_replslot_drop()? ISTM
that we should instead add a reason parameter to
pgstat_report_replslot_drop()?

Added a reason parameter in pgstat_report_replslot_drop() and dropped
pgstat_send_droplogicalslot().

+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that rows older than xid might have been
+ * removed.
I don't think the past tense is correct - the rows better not be removed
yet on the standby, otherwise we'd potentially do something random in
decoding.

RIght, wording changed.

@@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
may consume changes from a slot at any given time.
</para>
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>
I think this should add that it's very advisable to use a physical slot
between primary and standby. Otherwise hot_standby_feedback will work,
but only while the connection is alive - as soon as it breaks, a node
gets restarted, ...

Good point, wording added .

v17 attached does contain those changes.

Remarks related to the TAP tests have not been addressed in v17, will
look at it now.

Bertrand

Attachments:

v17-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v17-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 5b8d0df73631da19a0c4574e56eea814dffd5a18 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Apr 2021 09:30:32 +0000
Subject: [PATCH v17 5/5] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5d049cdc68..96515228ee 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get dropped if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v17-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v17-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From d4ca20127b1b9d48f20da816983e1adf66e48892 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Apr 2021 09:29:45 +0000
Subject: [PATCH v17 4/5] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/024_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../025_standby_logical_decoding_conflicts.pl | 228 +++++++++++++++
 3 files changed, 537 insertions(+)
   5.8% src/test/perl/
  94.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e26b2b3f30..7c53da608b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2552,6 +2552,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..426a412b1f
--- /dev/null
+++ b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,228 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v17-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v17-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 7083b4ecf3f1ef14340d848ad3804ac8b9995dad Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Apr 2021 09:28:17 +0000
Subject: [PATCH v17 3/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 16 +++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 122 insertions(+), 63 deletions(-)
  17.6% src/backend/access/transam/
  34.4% src/backend/replication/logical/
  42.3% src/backend/replication/
   5.6% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bea8adcc0f..c7f2f9def4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9599,7 +9610,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	InvalidateObsoleteReplicationSlots(_logSegNo);
@@ -11712,7 +11723,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11724,6 +11735,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index f363a4c639..0a93d6924d 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..81088baae3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -850,7 +850,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a459..88ab9fd03a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -437,7 +437,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -801,7 +801,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9aab713684..0d182ac219 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get dropped when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f6803637b..2dd50742e3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 01d354829b..9878610c21 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -223,7 +223,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	(void) ReplicationSlotAcquire(NameStr(*name), SAB_Error);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9574999628..134a04f96d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1172,37 +1172,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1218,6 +1209,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index d9d36879ed..d6bd43c216 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -636,7 +636,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	(void) ReplicationSlotAcquire(NameStr(*slotname), SAB_Error);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e3806fc..17f04d368f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -402,7 +402,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1068,7 +1068,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4bf8a18e01..823743afa9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1382,7 +1382,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1416,7 +1416,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2865,10 +2865,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2957,7 +2959,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..c7fb52240e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -316,7 +316,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v17-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v17-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From f9d78751bc27d5977adf4998dffba15e16793b7e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Apr 2021 09:21:54 +0000
Subject: [PATCH v17 2/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
drop such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  10 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  14 ++
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c      |  28 +++-
 src/backend/replication/slot.c       | 194 ++++++++++++++++++++++++++-
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 +++
 src/backend/utils/adt/pgstatfuncs.c  |  16 +++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   4 +-
 src/include/replication/slot.h       |   2 +
 src/include/storage/procsignal.h     |   2 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   3 +-
 21 files changed, 322 insertions(+), 25 deletions(-)
   4.2% src/backend/access/heap/
   5.5% src/backend/access/transam/
   5.3% src/backend/access/
   7.5% src/backend/postmaster/
  49.3% src/backend/replication/
   5.8% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   3.7% src/backend/
   7.6% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52958b4fd9..20bc1a8ed1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4001,6 +4001,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index af35a991fc..f91d4c36e2 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9c5d920d2c..a2846c55f1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8465,7 +8465,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8633,7 +8634,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8770,7 +8773,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..36ee313428 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1d4415a43..bea8adcc0f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10358,6 +10358,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Drop logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or dropped existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
+				gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a47e102f36..d043230fca 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1016,7 +1016,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 958183dd69..39148c9ed6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1829,14 +1829,26 @@ pgstat_report_replslot(const char *slotname, PgStat_Counter spilltxns,
  * ----------
  */
 void
-pgstat_report_replslot_drop(const char *slotname)
+pgstat_report_replslot_drop(const char *slotname, Oid dbOid, ProcSignalReason reason)
 {
-	PgStat_MsgReplSlot msg;
+	if (reason == PROCSIG_NO_SIGNAL)
+	{
+		PgStat_MsgReplSlot msg;
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-	strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-	msg.m_drop = true;
-	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+		pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
+		strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
+		msg.m_drop = true;
+		pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+	}
+	else
+	{
+		PgStat_MsgRecoveryConflict msg;
+
+		pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+		msg.m_databaseid = dbOid;
+		msg.m_reason = reason;
+		pgstat_send(&msg, sizeof(msg));
+	}
 }
 
 /* ----------
@@ -3464,6 +3476,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5380,6 +5393,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..9574999628 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/lock.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -102,7 +103,8 @@ int			max_replication_slots = 0;	/* the maximum number of replication
 static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
 										  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
-static void ReplicationSlotDropPtr(ReplicationSlot *slot);
+static void ReplicationSlotDropPtr(ReplicationSlot *slot, ProcSignalReason reason);
+static void ReplicationSlotDropConflicting(ReplicationSlot *slot);
 
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
@@ -567,7 +569,7 @@ restart:
 			SpinLockRelease(&s->mutex);
 			LWLockRelease(ReplicationSlotControlLock);	/* avoid deadlock */
 
-			ReplicationSlotDropPtr(s);
+			ReplicationSlotDropPtr(s, PROCSIG_NO_SIGNAL);
 
 			ConditionVariableBroadcast(&s->active_cv);
 			goto restart;
@@ -605,7 +607,7 @@ ReplicationSlotDropAcquired(void)
 	/* slot isn't acquired anymore */
 	MyReplicationSlot = NULL;
 
-	ReplicationSlotDropPtr(slot);
+	ReplicationSlotDropPtr(slot, PROCSIG_NO_SIGNAL);
 }
 
 /*
@@ -613,7 +615,7 @@ ReplicationSlotDropAcquired(void)
  * this function returns.
  */
 static void
-ReplicationSlotDropPtr(ReplicationSlot *slot)
+ReplicationSlotDropPtr(ReplicationSlot *slot, ProcSignalReason reason)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -710,7 +712,7 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	 * create the statistics for the replication slot.
 	 */
 	if (SlotIsLogical(slot))
-		pgstat_report_replslot_drop(NameStr(slot->data.name));
+		pgstat_report_replslot_drop(NameStr(slot->data.name), slot->data.database, reason);
 
 	/*
 	 * We release this at the very end, so that nobody starts trying to create
@@ -719,6 +721,71 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	LWLockRelease(ReplicationSlotAllocationLock);
 }
 
+/*
+ * Permanently drop a conflicting replication slot. If it's already active by
+ * another backend, send it a recovery conflict signal, and then try again.
+ */
+static void
+ReplicationSlotDropConflicting(ReplicationSlot *slot)
+{
+	pid_t		active_pid;
+	PGPROC	   *proc;
+	VirtualTransactionId	vxid;
+	bool initially_not_active;
+
+	ConditionVariablePrepareToSleep(&slot->active_cv);
+	initially_not_active = true;
+	while (1)
+	{
+		SpinLockAcquire(&slot->mutex);
+		active_pid = slot->active_pid;
+		if (active_pid == 0)
+			active_pid = slot->active_pid = MyProcPid;
+		SpinLockRelease(&slot->mutex);
+
+		/* Drop the acquired slot, unless it is acquired by another backend */
+		if (active_pid == MyProcPid)
+		{
+			elog(DEBUG1, "acquired conflicting slot, now dropping it");
+			if (initially_not_active)
+				ReplicationSlotDropPtr(slot, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+			else
+				ReplicationSlotDropPtr(slot, PROCSIG_NO_SIGNAL);
+			break;
+		}
+
+		/* slot was active */
+		initially_not_active = false;
+
+		/* Send the other backend, a conflict recovery signal */
+		SetInvalidVirtualTransactionId(vxid);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		proc = BackendPidGetProcWithLock(active_pid);
+		if (proc)
+			GET_VXID_FROM_PGPROC(vxid, *proc);
+		LWLockRelease(ProcArrayLock);
+
+		/*
+		 * If coincidently that process finished, some other backend may
+		 * acquire the slot again. So start over again.
+		 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
+		 * that backend, because there is no other way to make it release the
+		 * slot. So don't bother to validate vxid.localTransactionId.
+		 */
+		if (vxid.backendId == InvalidBackendId)
+			continue;
+
+		elog(DEBUG1, "cancelling pid %d (backendId: %d) for releasing slot",
+					 active_pid, vxid.backendId);
+
+		CancelVirtualTransaction(vxid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+		ConditionVariableSleep(&slot->active_cv,
+							   WAIT_EVENT_REPLICATION_SLOT_DROP);
+	}
+
+	ConditionVariableCancelSleep();
+}
+
 /*
  * Serialize the currently acquired slot's state from memory to disk, thereby
  * guaranteeing the current state will survive a crash.
@@ -1153,6 +1220,123 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to drop slots that depend on seeing those rows.
+ * When xid is invalid, drop all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be dropped. Also, when xid is invalid, a common 'conflict_reason' is
+ * provided for the error detail; otherwise it is NULL, in which case it is
+ * constructed out of the xid value.
+ */
+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+										char *conflict_reason)
+{
+	int			i;
+	bool		found_conflict = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+restart:
+	if (found_conflict)
+	{
+		CHECK_FOR_INTERRUPTS();
+		found_conflict = false;
+	}
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s;
+
+		s = &ReplicationSlotCtl->replication_slots[i];
+
+		/* cannot change while ReplicationSlotCtlLock is held */
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* Invalid xid means caller is asking to drop all logical slots */
+		if (!TransactionIdIsValid(xid))
+			found_conflict = true;
+		else
+		{
+			TransactionId slot_xmin;
+			TransactionId slot_catalog_xmin;
+			StringInfoData	conflict_str, conflict_xmins;
+			char	   *conflict_sentence =
+				gettext_noop("Slot conflicted with xid horizon which was being increased to");
+
+			/* not our database, skip */
+			if (s->data.database != InvalidOid && s->data.database != dboid)
+				continue;
+
+			SpinLockAcquire(&s->mutex);
+			slot_xmin = s->data.xmin;
+			slot_catalog_xmin = s->data.catalog_xmin;
+			SpinLockRelease(&s->mutex);
+
+			/*
+			 * Build the conflict_str which will look like :
+			 * "Slot conflicted with xid horizon which was being increased
+			 * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
+			 */
+			initStringInfo(&conflict_xmins);
+			if (TransactionIdIsValid(slot_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_xmin, xid))
+			{
+				appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
+			}
+			if (TransactionIdIsValid(slot_catalog_xmin) &&
+				TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
+								 conflict_xmins.len > 0 ? ", " : "",
+								 slot_catalog_xmin);
+
+			if (conflict_xmins.len > 0)
+			{
+				initStringInfo(&conflict_str);
+				appendStringInfo(&conflict_str, "%s %d (%s).",
+								 conflict_sentence, xid, conflict_xmins.data);
+				found_conflict = true;
+				conflict_reason = conflict_str.data;
+			}
+		}
+
+		if (found_conflict)
+		{
+			NameData	slotname;
+
+			SpinLockAcquire(&s->mutex);
+			slotname = s->data.name;
+			SpinLockRelease(&s->mutex);
+
+			/* ReplicationSlotDropConflicting() will acquire the lock below */
+			LWLockRelease(ReplicationSlotControlLock);
+
+			ReplicationSlotDropConflicting(s);
+
+			ereport(LOG,
+					(errmsg("dropped conflicting slot %s", NameStr(slotname)),
+					 errdetail("%s", conflict_reason)));
+
+			/* We released the lock above; so re-scan the slots. */
+			goto restart;
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Mark any slot that points to an LSN older than the given segment
  * as invalid; it requires WAL that's about to be removed.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf776286de..478fc4b77b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3377,6 +3377,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index eac6895141..a3fa6bdc01 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 1465ee44a1..d155a1de20 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		ResolveRecoveryConflictWithLogicalSlots(node.dbNode, latestRemovedXid, NULL);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 825fd55107..dc70ff52c3 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2449,6 +2449,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3019,6 +3022,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be dropped, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be dropped by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9fa4a93162..dd774ef302 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1499,6 +1499,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1542,6 +1557,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6feaaa4459..0442d87783 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5436,6 +5436,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 89cd324454..805282e9ae 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -19,6 +19,7 @@
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 #include "utils/wait_event.h" /* for backward compatibility */
+#include "storage/procsignal.h"
 
 
 /* ----------
@@ -740,6 +741,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1007,7 +1009,7 @@ extern void pgstat_report_replslot(const char *slotname, PgStat_Counter spilltxn
 								   PgStat_Counter spillcount, PgStat_Counter spillbytes,
 								   PgStat_Counter streamtxns, PgStat_Counter streamcount,
 								   PgStat_Counter streambytes);
-extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_drop(const char *slotname, Oid dbOid, ProcSignalReason reason);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c50d..b6e5ffff79 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -232,4 +232,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..3a5b70a967 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -29,6 +29,7 @@
  */
 typedef enum
 {
+	PROCSIG_NO_SIGNAL, 			/* used to not report anything in pgstat */
 	PROCSIG_CATCHUP_INTERRUPT,	/* sinval catchup interrupt */
 	PROCSIG_NOTIFY_INTERRUPT,	/* listen/notify interrupt */
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
@@ -41,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a8a1cc72d0..b1551251b3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v17-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v17-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 04df7109a20eaf7eca0bdf22b626294ec25db2dd Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Apr 2021 09:16:34 +0000
Subject: [PATCH v17 1/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 1054f6f1f2..8b064f32aa 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8dcd53c457..8ba6178a5f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..ce223b2c19 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 03d4abc938..9c5d920d2c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7987,6 +7987,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8017,7 +8018,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8027,6 +8028,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 0c8e49d3e6..d08c67ae16 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ef48679cc2..ebd35521b5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..28b1f961f4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9a3a03e520..a27cc2f6dc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -26,6 +26,7 @@
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -345,6 +346,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -622,6 +626,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

#130

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Andres Freund (#128)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi Andres,

On 4/8/21 5:47 AM, Andres Freund wrote:

Hi,

On 2021-04-07 13:32:18 -0700, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via walsender
will typically not process the records until further WAL writes come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this point. I
think we might be able to live with 10s delay issue for one release, but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

- There's about which timeline to use. If you use pg_recvlogical and you
restart the server, you'll see errors like:

pg_recvlogical: error: unexpected termination of replication stream: ERROR: requested WAL segment 000000000000000000000003 has already been removed

the real filename is 000000010000000000000003 - i.e. the timeline is
0.

This isn't too hard to fix, but definitely needs fixing.

Thanks, nice catch!

From what I have seen, we are not going through InitXLOGAccess() on a
Standby and in some cases (like the one you mentioned)
StartLogicalReplication() is called without IdentifySystem() being
called previously: this lead to ThisTimeLineID still set to 0.

I am proposing a fix in the attached v18 by adding a check in
StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.

- ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
leading us to drop a slot that has been created since we signalled a
recovery conflict. See
/messages/by-id/20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de
for some very similar issues.

I have rewritten this part by following the same logic as the one used
in 96540f80f8 (the commit linked to the thread you mentioned).

- Given the precedent of max_slot_wal_keep_size, I think it's wrong to
just drop the logical slots. Instead we should just mark them as
invalid, like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.

I am setting the slot's data.xmin and data.catalog_xmin as
InvalidTransactionId to mark the slot(s) as invalid in case of conflict.

- There's no tests covering timeline switches, what happens if there's a
promotion if logical decoding is currently ongoing.

I'll now work on the tests.

- The way ResolveRecoveryConflictWithLogicalSlots() builds the error
message is not good (and I've complained about it before...).

I changed it and made it more simple.

I also removed the details around mentioning xmin or catalog xmin (as I
am not sure of the added value and they are currently also not mentioned
during standby recovery snapshot conflict).

Unfortunately I think the things I have found are too many for me to
address within the given time. I'll send a version with a somewhat
polished set of the changes I made in the next few days...

Thanks for the review and feedback.

Please find enclosed v18 with the changes I worked on.

I still need to have a look on the tests.

There is also the 10s delay to work on, do you already have an idea on
how we should handle it?

Thanks

Bertrand

Attachments:

v18-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v18-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 0b981977e53fd82088bfe41ff694ba2110ad523f Mon Sep 17 00:00:00 2001
From: bdrouvot <bdrouvot@gmail.com>
Date: Sun, 13 Jun 2021 16:56:22 +0000
Subject: [PATCH v18 1/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index f46a42197c..80949bd4db 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8dcd53c457..8ba6178a5f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..ce223b2c19 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39..4019a2122e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7941,6 +7941,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7971,7 +7972,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7981,6 +7982,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a..0590b7053c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b8..2d27a3f974 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..28b1f961f4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 774ac5b2b1..7c016003a7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -26,6 +26,7 @@
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -357,6 +358,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -634,6 +638,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v18-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v18-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 888c5f9707c2a26096963db7856e5f08eddac595 Mon Sep 17 00:00:00 2001
From: bdrouvot <bdrouvot@gmail.com>
Date: Sun, 13 Jun 2021 17:09:31 +0000
Subject: [PATCH v18 2/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml         |  10 ++
 src/backend/access/gist/gistxlog.c   |   4 +-
 src/backend/access/hash/hash_xlog.c  |   3 +-
 src/backend/access/heap/heapam.c     |  10 +-
 src/backend/access/nbtree/nbtxlog.c  |   4 +-
 src/backend/access/spgist/spgxlog.c  |   1 +
 src/backend/access/transam/xlog.c    |  13 ++
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c      |  20 +++
 src/backend/replication/slot.c       | 209 +++++++++++++++++++++++++++
 src/backend/replication/walsender.c  |   8 +
 src/backend/storage/ipc/procarray.c  |   4 +
 src/backend/storage/ipc/procsignal.c |   3 +
 src/backend/storage/ipc/standby.c    |  11 +-
 src/backend/tcop/postgres.c          |  22 +++
 src/backend/utils/adt/pgstatfuncs.c  |  16 ++
 src/include/catalog/pg_proc.dat      |   5 +
 src/include/pgstat.h                 |   2 +
 src/include/replication/slot.h       |   3 +
 src/include/storage/procsignal.h     |   1 +
 src/include/storage/standby.h        |   4 +-
 src/test/regress/expected/rules.out  |   3 +-
 22 files changed, 346 insertions(+), 13 deletions(-)
   4.2% src/backend/access/heap/
   4.8% src/backend/access/transam/
   5.3% src/backend/access/
   4.4% src/backend/postmaster/
  54.2% src/backend/replication/
   5.8% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   3.7% src/backend/
   6.5% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcbb10fb6f..e9e1efa1b8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4029,6 +4029,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index af35a991fc..f91d4c36e2 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4019a2122e..75ca5f79f4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8419,7 +8419,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8587,7 +8588,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8724,7 +8727,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c2e920f159..004e8b7dd8 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 17eeff0720..469e26d93a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10360,6 +10360,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984068..e4b7830a1e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b0d07c0e0b..38fa5f4809 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1863,6 +1863,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3489,6 +3505,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5425,6 +5442,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a9a06b9a38..afa65a21e2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1315,6 +1315,215 @@ restart:
 	LWLockRelease(ReplicationSlotControlLock);
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if ((s->data.database != InvalidOid && s->data.database != dboid) && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536356..53e440c2ae 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1139,6 +1139,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 085bd1e407..88b1f5b3e9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3400,6 +3400,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 553b6e5460..b67e79c55a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8cea10c901..477c96df9d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 14056f5347..e64d2666d0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1500,6 +1500,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1543,6 +1558,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index acbcae4607..fdd130ebb8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5434,6 +5434,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..52623b4be9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -743,6 +743,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1011,6 +1012,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a530..bd292e5e14 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -215,6 +215,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -224,4 +225,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..66f28b649d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v18-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v18-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From ad4b3b5bf031d2c47b3a1beb7c45cdf5cef4f81e Mon Sep 17 00:00:00 2001
From: bdrouvot <bdrouvot@gmail.com>
Date: Sun, 13 Jun 2021 17:13:28 +0000
Subject: [PATCH v18 3/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 26 ++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 132 insertions(+), 63 deletions(-)
  16.7% src/backend/access/transam/
  32.7% src/backend/replication/logical/
  45.2% src/backend/replication/
   5.3% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 469e26d93a..37d0703ca9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9601,7 +9612,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	InvalidateObsoleteReplicationSlots(_logSegNo);
@@ -11713,7 +11724,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11725,6 +11736,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..81088baae3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -850,7 +850,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 75a95f3de7..11d8f1370c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -437,7 +437,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -801,7 +801,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 70670169ac..4a1976c59a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160e9f..3c46938972 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1f38c5b33e..abdc869f3c 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -223,7 +223,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index afa65a21e2..a2f96c3eda 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1085,37 +1085,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1131,6 +1122,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 31e74d3832..48d24442e2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -636,7 +636,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910bfe9..d3e5786668 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -402,7 +402,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1068,7 +1068,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 53e440c2ae..bc316cd28d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1166,6 +1166,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1390,7 +1400,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1424,7 +1434,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2890,10 +2900,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2982,7 +2994,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..c7fb52240e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -316,7 +316,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v18-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v18-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From f123acfe7bf15cfdcc988062f62d0d5b70b4762f Mon Sep 17 00:00:00 2001
From: bdrouvot <bdrouvot@gmail.com>
Date: Sun, 13 Jun 2021 17:14:58 +0000
Subject: [PATCH v18 4/5] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 +++
 .../t/024_standby_logical_decoding_xmins.pl   | 272 ++++++++++++++++++
 .../025_standby_logical_decoding_conflicts.pl | 228 +++++++++++++++
 3 files changed, 537 insertions(+)
   5.8% src/test/perl/
  94.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2027cbf43d..4984b2f6f4 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2612,6 +2612,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
new file mode 100644
index 0000000000..d654d79526
--- /dev/null
+++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
@@ -0,0 +1,272 @@
+# logical decoding on a standby : ensure xmins are appropriately updated
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+# Name for the logical slot on standby
+my $standby_slotname = 'standby_logical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->safe_psql('postgres', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+# After slot creation, xmins must be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null");
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+################################
+# xmin/catalog_xmin verification before and after standby-logical-slot creation.
+################################
+
+# With hot_standby_feedback off, xmin and catalog_xmin must still be null
+$slot = $node_master->slot($master_slotname);
+is($slot->{'xmin'}, '', "xmin null after standby join");
+is($slot->{'catalog_xmin'}, '', "catalog_xmin null after standby join");
+
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+# Create new slots on the standby, ignoring the ones on the master completely.
+#
+# This must succeed since we know we have a catalog_xmin reservation. We
+# might've already sent hot standby feedback to advance our physical slot's
+# catalog_xmin but not received the corresponding xlog for the catalog xmin
+# advance, in which case we'll create a slot that isn't usable. The calling
+# application can prevent this by creating a temporary slot on the master to
+# lock in its catalog_xmin. For a truly race-free solution we'd need
+# master-to-standby hot_standby_feedback replies.
+#
+# In this case it won't race because there's no concurrent activity on the
+# master.
+#
+$node_standby->create_logical_slot_on_standby($node_master, $standby_slotname, 'postgres');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Now that slot is created on standby, xmin and catalog_xmin should be non NULL
+# on both master and standby. But on master, the xmins will be updated only
+# after hot standby feedback is received.
+wait_for_xmins($node_master, $master_slotname, "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+isnt($slot->{'catalog_xmin'}, '', "logical catalog_xmin not null");
+
+
+################################
+# Standby logical slot should be able to fetch the table changes even when the
+# table is dropped.
+################################
+
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+$node_master->safe_psql('postgres', q[INSERT INTO test_table(blah) values ('itworks')]);
+$node_master->safe_psql('postgres', 'DROP TABLE test_table');
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# Should show the inserts even when the table is dropped on master
+($ret, $stdout, $stderr) = $node_standby->psql('postgres', qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($stderr, '', 'stderr is empty');
+is($ret, 0, 'replay from slot succeeded')
+	or die 'cannot continue if slot replay fails';
+is($stdout, q{BEGIN
+table public.test_table: INSERT: id[integer]:1 blah[text]:'itworks'
+COMMIT}, 'replay results match');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+$slot = $node_master->slot($master_slotname);
+isnt($slot->{'xmin'}, '', "physical xmin not null");
+my $saved_physical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_physical_catalog_xmin, '', "physical catalog_xmin not null");
+
+$slot = $node_standby->slot($standby_slotname);
+is($slot->{'xmin'}, '', "logical xmin null");
+my $saved_logical_catalog_xmin = $slot->{'catalog_xmin'};
+isnt($saved_logical_catalog_xmin, '', "logical catalog_xmin not null");
+
+
+################################
+# Catalog xmins should advance after standby logical slot fetches the changes.
+################################
+
+# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
+# we hold down xmin.
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
+$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
+for my $i (0 .. 2000)
+{
+    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
+}
+$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_2();]);
+$node_master->safe_psql('postgres', 'VACUUM');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+cmp_ok($node_standby->slot($standby_slotname)->{'catalog_xmin'}, "==",
+	   $saved_logical_catalog_xmin,
+	   "logical slot catalog_xmin hasn't advanced before get_changes");
+
+($ret, $stdout, $stderr) = $node_standby->psql('postgres',
+	qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-timestamp', '0')]);
+is($ret, 0, 'replay of big series succeeded');
+isnt($stdout, '', 'replayed some rows');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+# logical slot catalog_xmin on slave should advance after pg_logical_slot_get_changes
+wait_for_xmins($node_standby, $standby_slotname,
+			   "catalog_xmin::varchar::int > ${saved_logical_catalog_xmin}");
+$slot = $node_standby->slot($standby_slotname);
+my $new_logical_catalog_xmin = $slot->{'catalog_xmin'};
+is($slot->{'xmin'}, '', "logical xmin null");
+
+# hot standby feedback should advance master's phys catalog_xmin now that the
+# standby's slot doesn't hold it down as far.
+# But master's phys catalog_xmin should not go past the slave's logical slot's
+# catalog_xmin, even while master's phys xmin advances.
+#
+# First, make sure master's xmin is advanced. This happens on hot standby
+# feedback. So this check for master's xmin advance also makes sure hot standby
+# feedback has reached the master, which is required for the subsequent
+# catalog_xmin test.
+my $temp_phys_xmin = $node_master->slot($master_slotname)->{'xmin'};
+$node_master->safe_psql('postgres', 'SELECT txid_current()');
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin::varchar::int > ${temp_phys_xmin}");
+$slot = $node_master->slot($master_slotname);
+# Now check that the master's phys catalog_xmin has advanced but not beyond
+# standby's logical catalog_xmin
+cmp_ok($slot->{'catalog_xmin'}, ">", $saved_physical_catalog_xmin,
+	'upstream physical slot catalog_xmin has advanced with hs_feedback on');
+cmp_ok($slot->{'catalog_xmin'}, "==", $new_logical_catalog_xmin,
+	'upstream physical slot catalog_xmin not past downstream catalog_xmin with hs_feedback on');
+
+
+######################
+# Upstream oldestXid should not go past downstream catalog_xmin
+######################
+
+# First burn some xids on the master in another DB, so we push the master's
+# nextXid ahead.
+foreach my $i (1 .. 100)
+{
+	$node_master->safe_psql('postgres', 'SELECT txid_current()');
+}
+
+# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
+# past our needed xmin. The only way we have visibility into that is to force
+# a checkpoint.
+$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
+foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
+{
+	$node_master->safe_psql($dbname, 'VACUUM FREEZE');
+}
+$node_master->safe_psql('postgres', 'CHECKPOINT');
+IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
+	or die "pg_controldata failed with $?";
+my @checkpoint = split('\n', $stdout);
+my $oldestXid = '';
+foreach my $line (@checkpoint)
+{
+	if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
+	{
+		$oldestXid = $1;
+	}
+}
+die 'no oldestXID found in checkpoint' unless $oldestXid;
+
+cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
+	   'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
+
+$node_master->safe_psql('postgres',
+	"UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
+
+
+##################################################
+# Drop slot
+# Make sure standby slots are droppable, and properly clear the upstream's xmin
+##################################################
+
+is($node_standby->safe_psql('postgres', 'SHOW hot_standby_feedback'), 'on', 'hs_feedback is on');
+
+$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$standby_slotname')]);
+
+is($node_standby->slot($standby_slotname)->{'slot_type'}, '', 'slot on standby dropped manually');
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. catalog_xmin should become NULL because we dropped
+# the logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
diff --git a/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
new file mode 100644
index 0000000000..426a412b1f
--- /dev/null
+++ b/src/test/recovery/t/025_standby_logical_decoding_conflicts.pl
@@ -0,0 +1,228 @@
+# logical decoding on a standby : test conflict recovery; and other tests that
+# verify slots get dropped as expected.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+use RecursiveCopy;
+use File::Copy;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_master = get_new_node('master');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on master
+my $master_slotname = 'master_physical';
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+sub check_confl_logicalslot
+{
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_master, 'dropslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_master, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots()
+sub make_slot_active
+{
+	my $slot_user_handle;
+
+	# make sure activeslot is in use
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	while (!$node_standby->slot('activeslot')->{'active_pid'})
+	{
+		usleep(100_000);
+		print "waiting for slot to become active\n";
+	}
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'dropslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	eval {
+		$slot_user_handle->finish;
+	};
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
+		like($stderr, qr/must be dropped/, 'recvlogical error detail');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize master node
+########################
+
+$node_master->init(allows_streaming => 1, has_archiving => 1);
+$node_master->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+# send status rapidly so we promptly advance xmin on master
+wal_receiver_status_interval = 1
+# very promptly terminate conflicting backends
+max_standby_streaming_delay = '2s'
+});
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_master->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$master_slotname');]);
+my $backup_name = 'b1';
+$node_master->backup($backup_name);
+
+#######################
+# Initialize slave node
+#######################
+
+$node_standby->init_from_backup(
+	$node_master, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$master_slotname']);
+$node_standby->start;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on slave.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on master. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active();
+
+# This should trigger the conflict
+$node_master->safe_psql('testdb', 'VACUUM FULL');
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on master. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_master, $master_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Drop conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level at master
+##################################################
+
+create_logical_slots();
+
+$handle = make_slot_active();
+
+# Make master wal_level replica. This will trigger slot conflict.
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_master->restart;
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+check_slots_dropped($handle);
+
+check_confl_logicalslot();
+
+# Restore master wal_level
+$node_master->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_master->restart;
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+create_logical_slots();
+$handle = make_slot_active();
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres');
+
+# dropdb on the master to verify slots are dropped on standby
+$node_master->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_master->wait_for_catchup($node_standby, 'replay', $node_master->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
-- 
2.18.4

v18-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v18-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 2746c1f47388290bc523c2dcdf9ab12dcd186e41 Mon Sep 17 00:00:00 2001
From: bdrouvot <bdrouvot@gmail.com>
Date: Sun, 13 Jun 2021 17:17:38 +0000
Subject: [PATCH v18 5/5] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d2c6e15566..f8d2946149 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

#131

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Drouvot, Bertrand (#130)

5 attachment(s)

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

Hi Andres,

On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:

Hi Andres,

On 4/8/21 5:47 AM, Andres Freund wrote:

Hi,

On 2021-04-07 13:32:18 -0700, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via walsender
will typically not process the records until further WAL writes come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this point. I
think we might be able to live with 10s delay issue for one release,
but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

- There's about which timeline to use. If you use pg_recvlogical and you
   restart the server, you'll see errors like:

   pg_recvlogical: error: unexpected termination of replication
stream: ERROR: requested WAL segment 000000000000000000000003 has
already been removed

   the real filename is 000000010000000000000003 - i.e. the timeline is
   0.

   This isn't too hard to fix, but definitely needs fixing.

Thanks, nice catch!

From what I have seen, we are not going through InitXLOGAccess() on a
Standby and in some cases (like the one you mentioned)
StartLogicalReplication() is called without IdentifySystem() being
called previously: this lead to ThisTimeLineID still set to 0.

I am proposing a fix in the attached v18 by adding a check in
StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.

- ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
   leading us to drop a slot that has been created since we signalled a
   recovery conflict. See
/messages/by-id/20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de
   for some very similar issues.

I have rewritten this part by following the same logic as the one used
in 96540f80f8 (the commit linked to the thread you mentioned).

- Given the precedent of max_slot_wal_keep_size, I think it's wrong to
   just drop the logical slots. Instead we should just mark them as
   invalid, like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.

I am setting the slot's data.xmin and data.catalog_xmin as
InvalidTransactionId to mark the slot(s) as invalid in case of conflict.

- There's no tests covering timeline switches, what happens if there's a
   promotion if logical decoding is currently ongoing.

I'll now work on the tests.

- The way ResolveRecoveryConflictWithLogicalSlots() builds the error
   message is not good (and I've complained about it before...).

I changed it and made it more simple.

I also removed the details around mentioning xmin or catalog xmin (as
I am not sure of the added value and they are currently also not
mentioned during standby recovery snapshot conflict).

Unfortunately I think the things I have found are too many for me to
address within the given time. I'll send a version with a somewhat
polished set of the changes I made in the next few days...

Thanks for the review and feedback.

Please find enclosed v18 with the changes I worked on.

I still need to have a look on the tests.

Please find enclosed v19 that also contains the changes related to your
TAP tests remarks, mainly:

- get rid of 024 and add more tests in 026 (025 has been used in the
meantime)

- test that logical decoding actually produces useful and correct results

- test standby promotion and logical decoding behavior once done

- useless "use" removal

- check_confl_logicalslot() function removal

- rewrote make_slot_active() to make use of poll_query_until() and timeout

- remove the useless eval()

- remove the "Catalog xmins should advance after standby logical slot
fetches the changes" test

One thing that's not clear to me is your remark "There's also no test
for a recovery conflict due to row removal": Don't you think that the
"vacuum full" conflict test is enough? if not, what kind of additional
tests would you like to see?

There is also the 10s delay to work on, do you already have an idea on
how we should handle it?

Thanks

Bertrand

Thanks

Bertrand

Attachments:

v19-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v19-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From c0eae773a83916aa8dd652a67d8ef542fec327b7 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 22 Jun 2021 08:20:45 +0000
Subject: [PATCH v19 5/5] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1765ea6c87..de67bdde8b 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -297,6 +297,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v19-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v19-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 1e54c6ed8d75fd82cc4752f9fa078cdc54cd1f41 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 22 Jun 2021 08:19:13 +0000
Subject: [PATCH v19 4/5] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 ++
 .../t/026_standby_logical_decoding.pl         | 426 ++++++++++++++++++
 2 files changed, 463 insertions(+)
   7.2% src/test/perl/
  92.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2027cbf43d..4984b2f6f4 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2612,6 +2612,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/026_standby_logical_decoding.pl b/src/test/recovery/t/026_standby_logical_decoding.pl
new file mode 100644
index 0000000000..1687c63932
--- /dev/null
+++ b/src/test/recovery/t/026_standby_logical_decoding.pl
@@ -0,0 +1,426 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use Test::More tests => 32;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = get_new_node('primary');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = TestLib::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level >= logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.18.4

v19-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v19-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From b71ae587dcf987f231c823c4c2fd88f6229eef61 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 22 Jun 2021 08:13:34 +0000
Subject: [PATCH v19 3/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 26 ++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 132 insertions(+), 63 deletions(-)
  16.7% src/backend/access/transam/
  32.7% src/backend/replication/logical/
  45.2% src/backend/replication/
   5.3% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e863caf3b2..48b43c9fd2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5078,6 +5078,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9599,7 +9610,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	InvalidateObsoleteReplicationSlots(_logSegNo);
@@ -11711,7 +11722,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11723,6 +11734,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..81088baae3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -850,7 +850,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 75a95f3de7..11d8f1370c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -437,7 +437,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -801,7 +801,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc51e1..6684173a61 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160e9f..3c46938972 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1d9400ea63..9069f3e50d 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -223,7 +223,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f9231229cc..4302d56539 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1087,37 +1087,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1133,6 +1124,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 31e74d3832..48d24442e2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -636,7 +636,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index faeea9f0cc..4bce8f192c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -407,7 +407,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1075,7 +1075,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 53e440c2ae..bc316cd28d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1166,6 +1166,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1390,7 +1400,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1424,7 +1434,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2890,10 +2900,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2982,7 +2994,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..c7fb52240e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -316,7 +316,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -333,6 +333,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v19-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v19-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 6142c34b7a30f468aac56f24a41e3275dae0540e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 22 Jun 2021 08:12:16 +0000
Subject: [PATCH v19 2/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  10 +
 src/backend/access/gist/gistxlog.c            |   4 +-
 src/backend/access/hash/hash_xlog.c           |   3 +-
 src/backend/access/heap/heapam.c              |  10 +-
 src/backend/access/nbtree/nbtxlog.c           |   4 +-
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |  20 ++
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  11 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   2 +
 src/include/replication/slot.h                |   3 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   4 +-
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 351 insertions(+), 15 deletions(-)
   4.0% src/backend/access/heap/
   4.7% src/backend/access/transam/
   5.1% src/backend/access/
   4.3% src/backend/postmaster/
   3.5% src/backend/replication/logical/
  52.3% src/backend/replication/
   5.6% src/backend/storage/ipc/
   7.1% src/backend/tcop/
   3.5% src/backend/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcbb10fb6f..e9e1efa1b8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4029,6 +4029,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index af35a991fc..f91d4c36e2 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4019a2122e..75ca5f79f4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8419,7 +8419,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8587,7 +8588,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8724,7 +8727,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c2e920f159..004e8b7dd8 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,7 +669,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b3a3d9bea..e863caf3b2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10358,6 +10358,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984068..e4b7830a1e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b0d07c0e0b..38fa5f4809 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1863,6 +1863,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3489,6 +3505,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5425,6 +5442,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1f38c5b33e..1d9400ea63 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -241,11 +241,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4ed05..f9231229cc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1317,6 +1317,215 @@ restart:
 	LWLockRelease(ReplicationSlotControlLock);
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if ((s->data.database != InvalidOid && s->data.database != dboid) && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536356..53e440c2ae 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1139,6 +1139,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e4c008e443..bdc9ba1fc0 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3401,6 +3401,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 553b6e5460..b67e79c55a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8cea10c901..477c96df9d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 14056f5347..e64d2666d0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1500,6 +1500,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1543,6 +1558,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fde251fa4f..52a6366042 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5434,6 +5434,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..52623b4be9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -743,6 +743,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1011,6 +1012,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a530..bd292e5e14 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -215,6 +215,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -224,4 +225,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..66f28b649d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v19-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v19-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From e685cf4ef639dbc92992b66d2a33010b4a2cde6f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 22 Jun 2021 08:04:15 +0000
Subject: [PATCH v19 1/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index f46a42197c..80949bd4db 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8dcd53c457..8ba6178a5f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..ce223b2c19 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39..4019a2122e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7941,6 +7941,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7971,7 +7972,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7981,6 +7982,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a..0590b7053c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..6e89a08c52 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b8..2d27a3f974 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..28b1f961f4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 77d176a934..b488bb9618 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -26,6 +26,7 @@
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -365,6 +366,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -642,6 +646,11 @@ typedef struct ViewOptions
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

#132

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Drouvot, Bertrand (#131)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi Andres,

On 6/22/21 12:38 PM, Drouvot, Bertrand wrote:

Hi Andres,

On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:

Hi Andres,

On 4/8/21 5:47 AM, Andres Freund wrote:

Hi,

On 2021-04-07 13:32:18 -0700, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via
walsender
will typically not process the records until further WAL writes
come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at
the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this
point. I
think we might be able to live with 10s delay issue for one
release, but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if
you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

- There's about which timeline to use. If you use pg_recvlogical and
you
   restart the server, you'll see errors like:

   pg_recvlogical: error: unexpected termination of replication
stream: ERROR: requested WAL segment 000000000000000000000003 has
already been removed

   the real filename is 000000010000000000000003 - i.e. the timeline is
   0.

   This isn't too hard to fix, but definitely needs fixing.

Thanks, nice catch!

From what I have seen, we are not going through InitXLOGAccess() on a
Standby and in some cases (like the one you mentioned)
StartLogicalReplication() is called without IdentifySystem() being
called previously: this lead to ThisTimeLineID still set to 0.

I am proposing a fix in the attached v18 by adding a check in
StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.

- ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
   leading us to drop a slot that has been created since we signalled a
   recovery conflict. See
/messages/by-id/20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de

   for some very similar issues.

I have rewritten this part by following the same logic as the one
used in 96540f80f8 (the commit linked to the thread you mentioned).

- Given the precedent of max_slot_wal_keep_size, I think it's wrong to
   just drop the logical slots. Instead we should just mark them as
   invalid, like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.

I am setting the slot's data.xmin and data.catalog_xmin as
InvalidTransactionId to mark the slot(s) as invalid in case of conflict.

- There's no tests covering timeline switches, what happens if
there's a
   promotion if logical decoding is currently ongoing.

I'll now work on the tests.

- The way ResolveRecoveryConflictWithLogicalSlots() builds the error
   message is not good (and I've complained about it before...).

I changed it and made it more simple.

I also removed the details around mentioning xmin or catalog xmin (as
I am not sure of the added value and they are currently also not
mentioned during standby recovery snapshot conflict).

Unfortunately I think the things I have found are too many for me to
address within the given time. I'll send a version with a somewhat
polished set of the changes I made in the next few days...

Thanks for the review and feedback.

Please find enclosed v18 with the changes I worked on.

I still need to have a look on the tests.

Please find enclosed v19 that also contains the changes related to
your TAP tests remarks, mainly:

- get rid of 024 and add more tests in 026 (025 has been used in the
meantime)

- test that logical decoding actually produces useful and correct results

- test standby promotion and logical decoding behavior once done

- useless "use" removal

- check_confl_logicalslot() function removal

- rewrote make_slot_active() to make use of poll_query_until() and
timeout

- remove the useless eval()

- remove the "Catalog xmins should advance after standby logical slot
fetches the changes" test

One thing that's not clear to me is your remark "There's also no test
for a recovery conflict due to row removal": Don't you think that the
"vacuum full" conflict test is enough? if not, what kind of additional
tests would you like to see?

There is also the 10s delay to work on, do you already have an idea
on how we should handle it?

Thanks

Bertrand

Thanks

Bertrand

Please find enclosed v20 a needed rebase (nothing serious worth
mentioning) of v19.

FWIW, just to sum up that v19 (and so v20):

- contained the changes (see details above) related to your TAP tests
remarks

- contained the changes (see details above) related to your code remarks

There is still the 10s delay thing that need work: do you already have
an idea on how we should handle it?

And still one thing that's not clear to me is your remark "There's also
no test for a recovery conflict due to row removal": Don't you think
that the "vacuum full" conflict test is enough? if not, what kind of
additional tests would you like to see?

Thanks

Bertrand

Attachments:

v20-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v20-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 8c9292e11f8929c8d9018b957a7a306b3bf1e0c5 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Jul 2021 06:59:24 +0000
Subject: [PATCH v20 5/5] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 002efc86b4..455a432878 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v20-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v20-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From a191ddbb386340d8f506caa011edc4eef46c08d0 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Jul 2021 06:58:26 +0000
Subject: [PATCH v20 4/5] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 ++
 .../t/026_standby_logical_decoding.pl         | 426 ++++++++++++++++++
 2 files changed, 463 insertions(+)
   7.2% src/test/perl/
  92.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ed5b4a1c4b..50ac08b069 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2612,6 +2612,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/026_standby_logical_decoding.pl b/src/test/recovery/t/026_standby_logical_decoding.pl
new file mode 100644
index 0000000000..1687c63932
--- /dev/null
+++ b/src/test/recovery/t/026_standby_logical_decoding.pl
@@ -0,0 +1,426 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use Test::More tests => 32;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = get_new_node('primary');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = TestLib::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level >= logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.18.4

v20-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v20-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 9aaf5d67ce4348ee97f316d0cf00bfb5c7bf6b9c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Jul 2021 06:56:28 +0000
Subject: [PATCH v20 3/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 26 ++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 132 insertions(+), 63 deletions(-)
  16.7% src/backend/access/transam/
  32.7% src/backend/replication/logical/
  45.2% src/backend/replication/
   5.3% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5ff76ed7fc..3852e2eef8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5105,6 +5105,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9637,7 +9648,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	InvalidateObsoleteReplicationSlots(_logSegNo);
@@ -11751,7 +11762,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11763,6 +11774,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..81088baae3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -850,7 +850,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 75a95f3de7..11d8f1370c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -437,7 +437,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -801,7 +801,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..b8be0c83ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d61ef4cfad..c7ad715aa2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1d9400ea63..9069f3e50d 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -223,7 +223,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1375750496..9cf05ed556 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1088,37 +1088,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1134,6 +1125,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 31e74d3832..48d24442e2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -636,7 +636,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..74d3fa0cf0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -407,7 +407,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1072,7 +1072,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0e247b104..da2533e1c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1174,6 +1174,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1398,7 +1408,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1432,7 +1442,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2898,10 +2908,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2990,7 +3002,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ccfcf43d62..0d4cb405ee 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,7 +324,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -341,6 +341,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v20-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v20-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 842014eb95aea67720cf3aaa04bfb6a2ba650617 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Jul 2021 06:55:38 +0000
Subject: [PATCH v20 2/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  10 +
 src/backend/access/gist/gistxlog.c            |   4 +-
 src/backend/access/hash/hash_xlog.c           |   3 +-
 src/backend/access/heap/heapam.c              |  10 +-
 src/backend/access/nbtree/nbtxlog.c           |   4 +-
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |  20 ++
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  11 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   2 +
 src/include/replication/slot.h                |   3 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   4 +-
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 351 insertions(+), 15 deletions(-)
   4.0% src/backend/access/heap/
   4.7% src/backend/access/transam/
   5.1% src/backend/access/
   4.3% src/backend/postmaster/
   3.5% src/backend/replication/logical/
  52.3% src/backend/replication/
   5.6% src/backend/storage/ipc/
   7.1% src/backend/tcop/
   3.5% src/backend/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 643e1ad49f..a8ccfcd402 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4034,6 +4034,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 27475fcbd6..e5c6124400 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4019a2122e..75ca5f79f4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8419,7 +8419,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8587,7 +8588,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8724,7 +8727,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 786c08c0ce..41b7ec8e2d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -668,7 +668,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1006,6 +1007,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index edb15fe58d..5ff76ed7fc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10396,6 +10396,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..cd1fc88d17 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11702f2a80..556a515608 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1863,6 +1863,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3489,6 +3505,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5425,6 +5442,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1f38c5b33e..1d9400ea63 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -241,11 +241,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b85d86cc..1375750496 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1318,6 +1318,215 @@ restart:
 	LWLockRelease(ReplicationSlotControlLock);
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if ((s->data.database != InvalidOid && s->data.database != dboid) && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ca2a11389..d0e247b104 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1147,6 +1147,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4c91e721d0..73e1a461fa 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3401,6 +3401,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index aeecaf6cab..5facf6f3bf 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8cea10c901..477c96df9d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f0e09eae4d..5db0a29740 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1500,6 +1500,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1543,6 +1558,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8bf9d704b7..5c55ed4939 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5438,6 +5438,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..52623b4be9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -743,6 +743,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1011,6 +1012,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 34d95eac8e..5c0537da77 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -214,6 +214,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -223,4 +224,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..66f28b649d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v20-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v20-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 18864c84af5643d201c2052cc9e1c20b5dea8763 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Jul 2021 06:54:22 +0000
Subject: [PATCH v20 1/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..6d948548c3 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 43ba03b6eb..1d1e21112c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -820,7 +820,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -864,7 +864,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..ce223b2c19 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39..4019a2122e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7941,6 +7941,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7971,7 +7972,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7981,6 +7982,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a..0590b7053c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4720b35ee5..a04326c123 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b8..2d27a3f974 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..28b1f961f4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b4faa1c123..648aeacd78 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -365,6 +366,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -653,6 +657,11 @@ RelationGetSmgr(Relation rel)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

#133

Ibrar Ahmed

ibrar.ahmad@gmail.com

over 4 years ago

In reply to: Drouvot, Bertrand (#132)

Re: Minimal logical decoding on standbys

On Fri, Jul 16, 2021 at 1:07 PM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

Hi Andres,

On 6/22/21 12:38 PM, Drouvot, Bertrand wrote:

Hi Andres,

On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:

Hi Andres,

On 4/8/21 5:47 AM, Andres Freund wrote:

Hi,

On 2021-04-07 13:32:18 -0700, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via
walsender
will typically not process the records until further WAL writes
come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at
the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this
point. I
think we might be able to live with 10s delay issue for one
release, but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if
you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

- There's about which timeline to use. If you use pg_recvlogical and
you
restart the server, you'll see errors like:

pg_recvlogical: error: unexpected termination of replication
stream: ERROR: requested WAL segment 000000000000000000000003 has
already been removed

the real filename is 000000010000000000000003 - i.e. the timeline is
0.

This isn't too hard to fix, but definitely needs fixing.

Thanks, nice catch!

From what I have seen, we are not going through InitXLOGAccess() on a
Standby and in some cases (like the one you mentioned)
StartLogicalReplication() is called without IdentifySystem() being
called previously: this lead to ThisTimeLineID still set to 0.

I am proposing a fix in the attached v18 by adding a check in
StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.

- ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
leading us to drop a slot that has been created since we signalled a
recovery conflict. See

/messages/by-id/20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de

for some very similar issues.

I have rewritten this part by following the same logic as the one
used in 96540f80f8 (the commit linked to the thread you mentioned).

- Given the precedent of max_slot_wal_keep_size, I think it's wrong to
just drop the logical slots. Instead we should just mark them as
invalid, like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.

I am setting the slot's data.xmin and data.catalog_xmin as
InvalidTransactionId to mark the slot(s) as invalid in case of conflict.

- There's no tests covering timeline switches, what happens if
there's a
promotion if logical decoding is currently ongoing.

I'll now work on the tests.

- The way ResolveRecoveryConflictWithLogicalSlots() builds the error
message is not good (and I've complained about it before...).

I changed it and made it more simple.

I also removed the details around mentioning xmin or catalog xmin (as
I am not sure of the added value and they are currently also not
mentioned during standby recovery snapshot conflict).

Unfortunately I think the things I have found are too many for me to
address within the given time. I'll send a version with a somewhat
polished set of the changes I made in the next few days...

Thanks for the review and feedback.

Please find enclosed v18 with the changes I worked on.

I still need to have a look on the tests.

Please find enclosed v19 that also contains the changes related to
your TAP tests remarks, mainly:

- get rid of 024 and add more tests in 026 (025 has been used in the
meantime)

- test that logical decoding actually produces useful and correct results

- test standby promotion and logical decoding behavior once done

- useless "use" removal

- check_confl_logicalslot() function removal

- rewrote make_slot_active() to make use of poll_query_until() and
timeout

- remove the useless eval()

- remove the "Catalog xmins should advance after standby logical slot
fetches the changes" test

One thing that's not clear to me is your remark "There's also no test
for a recovery conflict due to row removal": Don't you think that the
"vacuum full" conflict test is enough? if not, what kind of additional
tests would you like to see?

There is also the 10s delay to work on, do you already have an idea
on how we should handle it?

Thanks

Bertrand

Thanks

Bertrand

Please find enclosed v20 a needed rebase (nothing serious worth
mentioning) of v19.

FWIW, just to sum up that v19 (and so v20):

- contained the changes (see details above) related to your TAP tests
remarks

- contained the changes (see details above) related to your code remarks

There is still the 10s delay thing that need work: do you already have
an idea on how we should handle it?

And still one thing that's not clear to me is your remark "There's also
no test for a recovery conflict due to row removal": Don't you think
that the "vacuum full" conflict test is enough? if not, what kind of
additional tests would you like to see?

Thanks

Bertrand

The patch does not apply and an updated patch is required.

patching file src/include/replication/slot.h
Hunk #1 FAILED at 214.
1 out of 2 hunks FAILED -- saving rejects to file
src/include/replication/slot.h.rej

--
Ibrar Ahmed

#134

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Ibrar Ahmed (#133)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 7/19/21 12:13 PM, Ibrar Ahmed wrote:

On Fri, Jul 16, 2021 at 1:07 PM Drouvot, Bertrand <bdrouvot@amazon.com
<mailto:bdrouvot@amazon.com>> wrote:

Hi Andres,

On 6/22/21 12:38 PM, Drouvot, Bertrand wrote:

Hi Andres,

On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:

Hi Andres,

On 4/8/21 5:47 AM, Andres Freund wrote:

Hi,

On 2021-04-07 13:32:18 -0700, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via
walsender
will typically not process the records until further WAL writes
come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay*

LSN to

increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken

up at

the
same time that the startup process will be - which means that

by the

time the logical walsender checks GetXLogReplayRecPtr() it's

unlikely

that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this
point. I
think we might be able to live with 10s delay issue for one
release, but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly

occuring if

you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

- There's about which timeline to use. If you use

pg_recvlogical and

you
   restart the server, you'll see errors like:

   pg_recvlogical: error: unexpected termination of replication
stream: ERROR: requested WAL segment 000000000000000000000003

has

already been removed

   the real filename is 000000010000000000000003 - i.e. the

timeline is

   0.

   This isn't too hard to fix, but definitely needs fixing.

Thanks, nice catch!

From what I have seen, we are not going through

InitXLOGAccess() on a

Standby and in some cases (like the one you mentioned)
StartLogicalReplication() is called without IdentifySystem() being
called previously: this lead to ThisTimeLineID still set to 0.

I am proposing a fix in the attached v18 by adding a check in
StartLogicalReplication() and ensuring that ThisTimeLineID is

retrieved.

- ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
   leading us to drop a slot that has been created since we

signalled a

   recovery conflict. See

/messages/by-id/20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de
</messages/by-id/20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de>

   for some very similar issues.

I have rewritten this part by following the same logic as the one
used in 96540f80f8 (the commit linked to the thread you mentioned).

- Given the precedent of max_slot_wal_keep_size, I think it's

wrong to

   just drop the logical slots. Instead we should just mark

them as

   invalid, like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.

I am setting the slot's data.xmin and data.catalog_xmin as
InvalidTransactionId to mark the slot(s) as invalid in case of

conflict.

- There's no tests covering timeline switches, what happens if
there's a
   promotion if logical decoding is currently ongoing.

I'll now work on the tests.

- The way ResolveRecoveryConflictWithLogicalSlots() builds the

error

   message is not good (and I've complained about it before...).

I changed it and made it more simple.

I also removed the details around mentioning xmin or catalog

xmin (as

I am not sure of the added value and they are currently also not
mentioned during standby recovery snapshot conflict).

Unfortunately I think the things I have found are too many for

me to

address within the given time. I'll send a version with a somewhat
polished set of the changes I made in the next few days...

Thanks for the review and feedback.

Please find enclosed v18 with the changes I worked on.

I still need to have a look on the tests.

Please find enclosed v19 that also contains the changes related to
your TAP tests remarks, mainly:

- get rid of 024 and add more tests in 026 (025 has been used in

the

meantime)

- test that logical decoding actually produces useful and

correct results

- test standby promotion and logical decoding behavior once done

- useless "use" removal

- check_confl_logicalslot() function removal

- rewrote make_slot_active() to make use of poll_query_until() and
timeout

- remove the useless eval()

- remove the "Catalog xmins should advance after standby logical

slot

fetches the changes" test

One thing that's not clear to me is your remark "There's also no

test

for a recovery conflict due to row removal": Don't you think

that the

"vacuum full" conflict test is enough? if not, what kind of

additional

tests would you like to see?

There is also the 10s delay to work on, do you already have an

idea

on how we should handle it?

Thanks

Bertrand

Thanks

Bertrand

Please find enclosed v20 a needed rebase (nothing serious worth
mentioning) of v19.

FWIW, just to sum up that v19 (and so v20):

- contained the changes (see details above) related to your TAP tests
remarks

- contained the changes (see details above) related to your code
remarks

There is still the 10s delay thing that need work: do you already
have
an idea on how we should handle it?

And still one thing that's not clear to me is your remark "There's
also
no test for a recovery conflict due to row removal": Don't you think
that the "vacuum full" conflict test is enough? if not, what kind of
additional tests would you like to see?

Thanks

Bertrand

The patch does not apply and an updated patch is required.
patching file src/include/replication/slot.h
Hunk #1 FAILED at 214.
1 out of 2 hunks FAILED -- saving rejects to file src/include/replication/slot.h.rej

Thanks for the warning, rebase done and new v21 version attached.

Bertrand

Attachments:

v21-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v21-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 04de17535c74b494d7056fec7592fc2e4ef1b01a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 26 Jul 2021 08:39:00 +0000
Subject: [PATCH v21 2/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  10 +
 src/backend/access/gist/gistxlog.c            |   4 +-
 src/backend/access/hash/hash_xlog.c           |   3 +-
 src/backend/access/heap/heapam.c              |  10 +-
 src/backend/access/nbtree/nbtxlog.c           |   4 +-
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |  20 ++
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  11 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   2 +
 src/include/replication/slot.h                |   3 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   4 +-
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 351 insertions(+), 15 deletions(-)
   4.0% src/backend/access/heap/
   4.7% src/backend/access/transam/
   5.1% src/backend/access/
   4.3% src/backend/postmaster/
   3.5% src/backend/replication/logical/
  52.3% src/backend/replication/
   5.6% src/backend/storage/ipc/
   7.1% src/backend/tcop/
   3.5% src/backend/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 74a58a916c..6072eee73e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4034,6 +4034,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 27475fcbd6..e5c6124400 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4019a2122e..75ca5f79f4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8419,7 +8419,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8587,7 +8588,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8724,7 +8727,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 786c08c0ce..41b7ec8e2d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -668,7 +668,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1006,6 +1007,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3479402272..c376eb919f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10428,6 +10428,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..cd1fc88d17 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11702f2a80..556a515608 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1863,6 +1863,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3489,6 +3505,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5425,6 +5442,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1f38c5b33e..1d9400ea63 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -241,11 +241,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33e9acab4a..6b0d5012e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1338,6 +1338,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if ((s->data.database != InvalidOid && s->data.database != dboid) && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ca2a11389..d0e247b104 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1147,6 +1147,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 09c97c58b8..568ea9c913 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3443,6 +3443,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index aeecaf6cab..5facf6f3bf 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 530caa520b..fc003299dc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f0e09eae4d..5db0a29740 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1500,6 +1500,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1543,6 +1558,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8cd0252082..6dfcfa983f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5438,6 +5438,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..52623b4be9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -743,6 +743,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1011,6 +1012,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index e32fb85db8..4779617cd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -214,6 +214,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -223,4 +224,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..66f28b649d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v21-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v21-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 54b2f5902b03cace966ea58ac9671635ab6a59e6 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 26 Jul 2021 08:44:50 +0000
Subject: [PATCH v21 5/5] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 89b8090b79..f66bb04d27 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v21-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v21-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 279520c5789a8f350eaaaf2c7f75920679dcb3cc Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 26 Jul 2021 08:43:51 +0000
Subject: [PATCH v21 4/5] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 ++
 .../t/026_standby_logical_decoding.pl         | 426 ++++++++++++++++++
 2 files changed, 463 insertions(+)
   7.2% src/test/perl/
  92.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ed5b4a1c4b..50ac08b069 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2612,6 +2612,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/026_standby_logical_decoding.pl b/src/test/recovery/t/026_standby_logical_decoding.pl
new file mode 100644
index 0000000000..1687c63932
--- /dev/null
+++ b/src/test/recovery/t/026_standby_logical_decoding.pl
@@ -0,0 +1,426 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use Test::More tests => 32;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = get_new_node('primary');
+my $node_standby = get_new_node('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = TestLib::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level >= logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.18.4

v21-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v21-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 85e109ee4c43a26b5bea86e18e5e47389235b1ef Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 26 Jul 2021 08:42:28 +0000
Subject: [PATCH v21 3/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 26 ++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 132 insertions(+), 63 deletions(-)
  16.7% src/backend/access/transam/
  32.7% src/backend/replication/logical/
  45.2% src/backend/replication/
   5.3% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c376eb919f..3c8b53670e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5105,6 +5105,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9645,7 +9656,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
@@ -11783,7 +11794,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11795,6 +11806,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..81088baae3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -850,7 +850,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 75a95f3de7..11d8f1370c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -437,7 +437,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -801,7 +801,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..b8be0c83ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d61ef4cfad..c7ad715aa2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1d9400ea63..9069f3e50d 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -223,7 +223,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6b0d5012e0..84674ae7bb 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1088,37 +1088,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1134,6 +1125,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 31e74d3832..48d24442e2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -636,7 +636,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..74d3fa0cf0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -407,7 +407,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1072,7 +1072,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0e247b104..da2533e1c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1174,6 +1174,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1398,7 +1408,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1432,7 +1442,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2898,10 +2908,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2990,7 +3002,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ccfcf43d62..0d4cb405ee 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,7 +324,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -341,6 +341,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v21-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v21-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 578872c14abc4952a27c583f614a4e49e8503627 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 26 Jul 2021 07:41:08 +0000
Subject: [PATCH v21 1/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..6d948548c3 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 43ba03b6eb..1d1e21112c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -820,7 +820,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -864,7 +864,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..ce223b2c19 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39..4019a2122e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7941,6 +7941,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7971,7 +7972,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7981,6 +7982,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a..0590b7053c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4720b35ee5..a04326c123 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b8..2d27a3f974 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 6bba5f8ec4..28b1f961f4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b4faa1c123..648aeacd78 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -365,6 +366,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -653,6 +657,11 @@ RelationGetSmgr(Relation rel)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

#135

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Drouvot, Bertrand (#134)

Re: Minimal logical decoding on standbys

Hi,

On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:

Thanks for the warning, rebase done and new v21 version attached.

Did you have a go at fixing the walsender race conditions I
(re-)discovered? Without fixing those I don't see this patch going in...

Greetings,

Andres Freund

#136

Alvaro Herrera

alvherre@2ndquadrant.com

over 4 years ago

In reply to: Drouvot, Bertrand (#134)

Re: Minimal logical decoding on standbys

On 2021-Jul-27, Drouvot, Bertrand wrote:

diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c

+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}

So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.
Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.
At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs instead of opening and closing it time and again.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)

#137

Ronan Dunklau

ronan.dunklau@aiven.io

over 4 years ago

In reply to: Drouvot, Bertrand (#134)

Re: Minimal logical decoding on standbys

Le mardi 27 juillet 2021, 09:23:48 CEST Drouvot, Bertrand a écrit :

Thanks for the warning, rebase done and new v21 version attached.

Bertrand

Hello,

I've taken a look at this patch, and it looks like you adressed every prior
remark, including the race condition Andres was worried about.

As for the basics: make check-world and make installcheck-world pass.

I think the beahviour when dropping a database on the primary should be
documented, and proper procedures for handling it correctly should be
suggested.

Something along the lines of:

"If a database is dropped on the primary server, the logical replication slot
on the standby will be dropped as well. This means that you should ensure that
the client usually connected to this slot has had the opportunity to stream
the latest changes before the database is dropped."

As for the patches themselves, I only have two small comments to make.

In patch 0002, in InvalidateConflictingLogicalReplicationSlots, I don't see the
need to check for an InvalidOid since we already check the SlotIsLogical:

+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, 
skip */
+		if ((s->data.database != InvalidOid && s->data.database 
!= dboid) && TransactionIdIsValid(xid))
+			continue;

In patch 0004, small typo in the test file:
+##################################################
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##################################################

Thank you for working on this !

Regards,

--
Ronan Dunklau

#138

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Andres Freund (#135)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi Andres,

On 7/27/21 7:22 PM, Andres Freund wrote:

Hi,

On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:

Thanks for the warning, rebase done and new v21 version attached.

Did you have a go at fixing the walsender race conditions I
(re-)discovered? Without fixing those I don't see this patch going in...

Those new patches should be addressing all your previous code and TAP
tests remarks, except those 2 for which I would need your input:

1. The first one is linked to your remarks:
"

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via walsender
will typically not process the records until further WAL writes come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this point. I
think we might be able to live with 10s delay issue for one release, but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

Is that what you are referring to as the “walsender race conditions”?
If so, do you already have in mind a way to handle this? (I thought you
already had in mind a way to handle it so the question)

2. The second one is linked to your remark:

"There's also no test  for a recovery conflict due to row removal"

Don't you think that the  "vacuum full" conflict test is enough?

if not, what kind of additional  tests would you like to see?

In the same time, I am attaching a new v22 as a rebase was needed since v21.

Thanks

Bertrand

Attachments:

v22-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v22-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 98cb704e2d3190cae9440594d9e2a0a872022f8b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 2 Aug 2021 13:59:34 +0000
Subject: [PATCH v22 5/5] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 89b8090b79..f66bb04d27 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v22-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v22-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 00bf31c7bb369291ce7c983e579f3ab860f18ae8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 2 Aug 2021 13:58:38 +0000
Subject: [PATCH v22 4/5] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 ++
 .../t/026_standby_logical_decoding.pl         | 426 ++++++++++++++++++
 2 files changed, 463 insertions(+)
   7.2% src/test/perl/
  92.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8158ea5b2f..a8c73929ed 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2649,6 +2649,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/026_standby_logical_decoding.pl b/src/test/recovery/t/026_standby_logical_decoding.pl
new file mode 100644
index 0000000000..e18ffdba7f
--- /dev/null
+++ b/src/test/recovery/t/026_standby_logical_decoding.pl
@@ -0,0 +1,426 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use Test::More tests => 32;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgresNode->new('primary');
+my $node_standby = PostgresNode->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = TestLib::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1 : hot_standby_feedback off
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2 : incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level >= logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.18.4

v22-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v22-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 9f4da7f5ef34ae6e5920cdc3f345133dab1e282b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 2 Aug 2021 13:56:36 +0000
Subject: [PATCH v22 3/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 26 ++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 132 insertions(+), 63 deletions(-)
  16.7% src/backend/access/transam/
  32.7% src/backend/replication/logical/
  45.2% src/backend/replication/
   5.3% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1ff7751857..a653f0184d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5082,6 +5082,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9614,7 +9625,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
@@ -11752,7 +11763,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11764,6 +11775,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b1702bc6be..9cc2a2144b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -870,7 +870,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bc9ac7ccfa..39e4da6b35 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -434,7 +434,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -798,7 +798,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..b8be0c83ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 64b8280c13..c134feef89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1d9400ea63..9069f3e50d 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -223,7 +223,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6e89bab255..251af17253 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1088,37 +1088,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1134,6 +1125,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 31e74d3832..48d24442e2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -636,7 +636,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..74d3fa0cf0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -407,7 +407,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1072,7 +1072,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0e247b104..da2533e1c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1174,6 +1174,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1398,7 +1408,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1432,7 +1442,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2898,10 +2908,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2990,7 +3002,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a8ede700d..1d44637bef 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -282,7 +282,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v22-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v22-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From c306b93841868c56ddc149ff089b13c7c6a8cf62 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 2 Aug 2021 13:55:45 +0000
Subject: [PATCH v22 2/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  10 +
 src/backend/access/gist/gistxlog.c            |   4 +-
 src/backend/access/hash/hash_xlog.c           |   3 +-
 src/backend/access/heap/heapam.c              |  10 +-
 src/backend/access/nbtree/nbtxlog.c           |   4 +-
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |  20 ++
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  11 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   2 +
 src/include/replication/slot.h                |   3 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   4 +-
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 351 insertions(+), 15 deletions(-)
   4.1% src/backend/access/heap/
   4.7% src/backend/access/transam/
   5.1% src/backend/access/
   4.3% src/backend/postmaster/
   3.5% src/backend/replication/logical/
  52.2% src/backend/replication/
   5.6% src/backend/storage/ipc/
   7.1% src/backend/tcop/
   3.6% src/backend/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 74a58a916c..6072eee73e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4034,6 +4034,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 27475fcbd6..e5c6124400 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4019a2122e..75ca5f79f4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8419,7 +8419,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8587,7 +8588,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8724,7 +8727,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 786c08c0ce..41b7ec8e2d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -668,7 +668,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1006,6 +1007,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01e..1ff7751857 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10397,6 +10397,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..cd1fc88d17 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11702f2a80..556a515608 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1863,6 +1863,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3489,6 +3505,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5425,6 +5442,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1f38c5b33e..1d9400ea63 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -241,11 +241,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33e9acab4a..6e89bab255 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1338,6 +1338,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ca2a11389..d0e247b104 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1147,6 +1147,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c7816fcfb3..0c555a390c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3443,6 +3443,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 077251c1a6..28121f0658 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -34,6 +34,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 530caa520b..fc003299dc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f0e09eae4d..5db0a29740 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1500,6 +1500,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1543,6 +1558,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8cd0252082..6dfcfa983f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5438,6 +5438,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..52623b4be9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -743,6 +743,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1011,6 +1012,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index e32fb85db8..4779617cd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -214,6 +214,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -223,4 +224,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..66f28b649d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v22-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v22-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From c083e958fb208d85d012741b1e19b382805c488c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 2 Aug 2021 13:47:07 +0000
Subject: [PATCH v22 1/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..6d948548c3 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 43ba03b6eb..1d1e21112c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -820,7 +820,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -864,7 +864,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..ce223b2c19 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39..4019a2122e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7941,6 +7941,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7971,7 +7972,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7981,6 +7982,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a..0590b7053c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd30..2597636bf0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b8..2d27a3f974 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4ebaa552a2..2fda238870 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b4faa1c123..648aeacd78 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -365,6 +366,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -653,6 +657,11 @@ RelationGetSmgr(Relation rel)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

#139

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Alvaro Herrera (#136)

Re: Minimal logical decoding on standbys

Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:

On 2021-Jul-27, Drouvot, Bertrand wrote:
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
+bool
+get_rel_logical_catalog(Oid relid)
+{
+     bool    res;
+     Relation rel;
+
+     /* assume previously locked */
+     rel = table_open(relid, NoLock);
+     res = RelationIsAccessibleInLogicalDecoding(rel);
+     table_close(rel, NoLock);
+
+     return res;
+}
So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.

Thanks for looking at it.

Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.
At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs instead of opening and closing it time and again.

That does make sense, I'll look at it.

Bertrand

#140

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Ronan Dunklau (#137)

Re: Minimal logical decoding on standbys

Hi Ronan,

Thanks for looking at it.

On 8/2/21 1:57 PM, Ronan Dunklau wrote:

Le mardi 27 juillet 2021, 09:23:48 CEST Drouvot, Bertrand a écrit :

Thanks for the warning, rebase done and new v21 version attached.

Bertrand

Hello,

I've taken a look at this patch, and it looks like you adressed every prior
remark, including the race condition Andres was worried about.

I think there is still 2 points that need to be addressed (see [1]/messages/by-id/69aad0bf-697a-04e1-df6c-0920ec8fa528@amazon.com)

As for the basics: make check-world and make installcheck-world pass.

I think the beahviour when dropping a database on the primary should be
documented, and proper procedures for handling it correctly should be
suggested.

Something along the lines of:

"If a database is dropped on the primary server, the logical replication slot
on the standby will be dropped as well. This means that you should ensure that
the client usually connected to this slot has had the opportunity to stream
the latest changes before the database is dropped."

I am not sure we should highlight this as part of this patch.

I mean, the same would currently occur on a non standby if you drop a
database that has a replication slot linked to it.

As for the patches themselves, I only have two small comments to make.

In patch 0002, in InvalidateConflictingLogicalReplicationSlots, I don't see the
need to check for an InvalidOid since we already check the SlotIsLogical:
+               /* We are only dealing with *logical* slot conflicts. */
+               if (!SlotIsLogical(s))
+                       continue;
+
+               /* not our database and we don't want all the database,
skip */
+               if ((s->data.database != InvalidOid && s->data.database
!= dboid) && TransactionIdIsValid(xid))
+                       continue;

Agree, v22 attached in [1]/messages/by-id/69aad0bf-697a-04e1-df6c-0920ec8fa528@amazon.com does remove the useless s->data.database !=
InvalidOid check, thanks!

In patch 0004, small typo in the test file:
+##################################################
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##################################################

Typo also fixed in v22, thanks!

Bertrand

[1]: /messages/by-id/69aad0bf-697a-04e1-df6c-0920ec8fa528@amazon.com
/messages/by-id/69aad0bf-697a-04e1-df6c-0920ec8fa528@amazon.com

#141

Ronan Dunklau

ronan.dunklau@aiven.io

over 4 years ago

In reply to: Drouvot, Bertrand (#140)

Re: Minimal logical decoding on standbys

Le lundi 2 août 2021, 17:31:46 CEST Drouvot, Bertrand a écrit :

I think the beahviour when dropping a database on the primary should be
documented, and proper procedures for handling it correctly should be
suggested.

Something along the lines of:

"If a database is dropped on the primary server, the logical replication
slot on the standby will be dropped as well. This means that you should
ensure that the client usually connected to this slot has had the
opportunity to stream the latest changes before the database is dropped."

I am not sure we should highlight this as part of this patch.

I mean, the same would currently occur on a non standby if you drop a
database that has a replication slot linked to it.

The way I see it, the main difference is that you drop an additional object on
the standby, which doesn't exist and that you don't necessarily have any
knowledge of on the primary. As such, I thought it would be better to be
explicit about it to warn users of that possible case.

Regards,

--
Ronan Dunklau

#142

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Drouvot, Bertrand (#138)

Re: Minimal logical decoding on standbys

Hi,

On 2021-08-02 16:45:23 +0200, Drouvot, Bertrand wrote:

On 7/27/21 7:22 PM, Andres Freund wrote:

On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:

Thanks for the warning, rebase done and new v21 version attached.

Did you have a go at fixing the walsender race conditions I
(re-)discovered? Without fixing those I don't see this patch going in...

Those new patches should be addressing all your previous code and TAP tests
remarks, except those 2 for which I would need your input:

1. The first one is linked to your remarks:
"

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via walsender
will typically not process the records until further WAL writes come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this point. I
think we might be able to live with 10s delay issue for one release, but
it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

"

Is that what you are referring to as the “walsender race conditions”?

Yes.

If so, do you already have in mind a way to handle this? (I thought you
already had in mind a way to handle it so the question)

Yes. I think we need to add a condition variable to be able to wait for
WAL positions to change. Either multiple condition variables (one for
the flush position, one for the replay position), or one that just
changes more often. That way one can wait for apply without a race
condition.

2. The second one is linked to your remark:

"There's also no test  for a recovery conflict due to row removal"

Don't you think that the  "vacuum full" conflict test is enough?

It's not. It'll cause conflicts due to exclusive locks etc.

if not, what kind of additional  tests would you like to see?

A few catalog rows being removed (e.g. due to DELETE and then VACUUM
*without* full) and a standby without hot_standby_feedback catching
that.

Greetings,

Andres Freund

#143

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Andres Freund (#142)

5 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 8/2/21 6:01 PM, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

If so, do you already have in mind a way to handle this? (I thought you
already had in mind a way to handle it so the question)

Yes. I think we need to add a condition variable to be able to wait for
WAL positions to change. Either multiple condition variables (one for
the flush position, one for the replay position), or one that just
changes more often. That way one can wait for apply without a race
condition.

Thanks for the feedback.

Wouldn't a condition variable on the replay position be enough? I don't
get why the proposed one on the flush position is needed.

if not, what kind of additional  tests would you like to see?

A few catalog rows being removed (e.g. due to DELETE and then VACUUM
*without* full) and a standby without hot_standby_feedback catching
that.

Test added in v23 attached.

Thanks

Bertrand

Attachments:

v23-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v23-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 482eca63f32c7573a393bbba1650b25cf381c66c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 6 Aug 2021 09:48:11 +0000
Subject: [PATCH v23 5/5] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 0d0de291f3..b5c53a8935 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v23-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v23-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From d740e50bca8720a80eb7ecb57f48fee88552ebdc Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 6 Aug 2021 09:47:12 +0000
Subject: [PATCH v23 4/5] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 ++
 .../t/026_standby_logical_decoding.pl         | 498 ++++++++++++++++++
 2 files changed, 535 insertions(+)
   6.0% src/test/perl/
  93.9% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8158ea5b2f..a8c73929ed 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2649,6 +2649,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/026_standby_logical_decoding.pl b/src/test/recovery/t/026_standby_logical_decoding.pl
new file mode 100644
index 0000000000..0513ccc353
--- /dev/null
+++ b/src/test/recovery/t/026_standby_logical_decoding.pl
@@ -0,0 +1,498 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use Test::More tests => 38;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgresNode->new('primary');
+my $node_standby = PostgresNode->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = TestLib::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level >= logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.18.4

v23-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v23-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 346f8e18f1dd834fbb1f2586b72c7cff46053fe2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 6 Aug 2021 09:42:43 +0000
Subject: [PATCH v23 3/5] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 26 ++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 132 insertions(+), 63 deletions(-)
  16.7% src/backend/access/transam/
  32.7% src/backend/replication/logical/
  45.2% src/backend/replication/
   5.3% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7b6d75cd4b..d3c04182a0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5082,6 +5082,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9608,7 +9619,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
@@ -11746,7 +11757,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11758,6 +11769,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b1702bc6be..9cc2a2144b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -870,7 +870,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..ea1bf7d247 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -434,7 +434,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -794,7 +794,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..b8be0c83ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 64b8280c13..c134feef89 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1d9400ea63..9069f3e50d 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -223,7 +223,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6e89bab255..251af17253 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1088,37 +1088,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1134,6 +1125,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 31e74d3832..48d24442e2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -636,7 +636,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..74d3fa0cf0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -407,7 +407,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1072,7 +1072,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0e247b104..da2533e1c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1174,6 +1174,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1398,7 +1408,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1432,7 +1442,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2898,10 +2908,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2990,7 +3002,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a8ede700d..1d44637bef 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -282,7 +282,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v23-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v23-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 59d6c2ff336974d1671a85e5de0c5f73882c8480 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 6 Aug 2021 09:41:36 +0000
Subject: [PATCH v23 2/5] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  10 +
 src/backend/access/gist/gistxlog.c            |   4 +-
 src/backend/access/hash/hash_xlog.c           |   3 +-
 src/backend/access/heap/heapam.c              |  10 +-
 src/backend/access/nbtree/nbtxlog.c           |   4 +-
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |  20 ++
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  11 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   2 +
 src/include/replication/slot.h                |   3 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   4 +-
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 351 insertions(+), 15 deletions(-)
   4.1% src/backend/access/heap/
   4.7% src/backend/access/transam/
   5.1% src/backend/access/
   4.3% src/backend/postmaster/
   3.5% src/backend/replication/logical/
  52.2% src/backend/replication/
   5.6% src/backend/storage/ipc/
   7.1% src/backend/tcop/
   3.6% src/backend/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 74a58a916c..6072eee73e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4034,6 +4034,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 27475fcbd6..e5c6124400 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4019a2122e..75ca5f79f4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8419,7 +8419,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8587,7 +8588,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8724,7 +8727,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 786c08c0ce..41b7ec8e2d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -668,7 +668,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1006,6 +1007,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d0ec6a834b..7b6d75cd4b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10391,6 +10391,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..cd1fc88d17 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 56755cb92b..4fd2117443 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1865,6 +1865,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3558,6 +3574,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5507,6 +5524,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 1f38c5b33e..1d9400ea63 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -241,11 +241,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33e9acab4a..6e89bab255 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1338,6 +1338,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ca2a11389..d0e247b104 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1147,6 +1147,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c7816fcfb3..0c555a390c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3443,6 +3443,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 077251c1a6..28121f0658 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -34,6 +34,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 58b5960e27..e4cb07abeb 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..073b402bac 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1497,6 +1497,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1540,6 +1555,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b603700ed9..94ce48eca7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5502,6 +5502,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2068a68a5f..ead9390d06 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -755,6 +755,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1047,6 +1048,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index e32fb85db8..4779617cd7 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -214,6 +214,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -223,4 +224,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
+
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..66f28b649d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v23-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v23-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 1bf7638a2d19356f2a2e164e0a54adfe40766662 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 6 Aug 2021 09:40:39 +0000
Subject: [PATCH v23 1/5] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..6d948548c3 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 43ba03b6eb..1d1e21112c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -820,7 +820,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -864,7 +864,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index d254a00b6a..ce223b2c19 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39..4019a2122e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7941,6 +7941,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7971,7 +7972,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7981,6 +7982,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a..0590b7053c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd30..2597636bf0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b8..2d27a3f974 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4ebaa552a2..2fda238870 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b4faa1c123..648aeacd78 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -365,6 +366,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -653,6 +657,11 @@ RelationGetSmgr(Relation rel)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

#144

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 4 years ago

In reply to: Drouvot, Bertrand (#143)

Re: Minimal logical decoding on standbys

I noticed the tests added in this patch set are very slow. Here are
some of the timings:

...
[13:26:59] t/018_wal_optimize.pl ................ ok 13976 ms
[13:27:13] t/019_replslot_limit.pl .............. ok 10976 ms
[13:27:24] t/020_archive_status.pl .............. ok 6190 ms
[13:27:30] t/021_row_visibility.pl .............. ok 3227 ms
[13:27:33] t/022_crash_temp_files.pl ............ ok 2296 ms
[13:27:36] t/023_pitr_prepared_xact.pl .......... ok 3601 ms
[13:27:39] t/024_archive_recovery.pl ............ ok 3937 ms
[13:27:43] t/025_stuck_on_old_timeline.pl ....... ok 4348 ms
[13:27:47] t/026_standby_logical_decoding.pl .... ok 117730 ms <<<

Is it possible to improve this?

#145

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Peter Eisentraut (#144)

Re: Re: Minimal logical decoding on standbys

Hi Peter,

On 8/26/21 1:35 PM, Peter Eisentraut wrote:

CAUTION: This email originated from outside of the organization. Do
not click links or open attachments unless you can confirm the sender
and know the content is safe.

I noticed the tests added in this patch set are very slow. Here are
some of the timings:

...
[13:26:59] t/018_wal_optimize.pl ................ ok    13976 ms
[13:27:13] t/019_replslot_limit.pl .............. ok    10976 ms
[13:27:24] t/020_archive_status.pl .............. ok     6190 ms
[13:27:30] t/021_row_visibility.pl .............. ok     3227 ms
[13:27:33] t/022_crash_temp_files.pl ............ ok     2296 ms
[13:27:36] t/023_pitr_prepared_xact.pl .......... ok     3601 ms
[13:27:39] t/024_archive_recovery.pl ............ ok     3937 ms
[13:27:43] t/025_stuck_on_old_timeline.pl ....... ok     4348 ms
[13:27:47] t/026_standby_logical_decoding.pl .... ok   117730 ms <<<

Is it possible to improve this?

Thanks for looking at it.

Once the walsender race conditions mentioned by Andres in [1]/messages/by-id/20210802160133.uugcce5ql4m5mv5m@alap3.anarazel.de are
addressed then i think that the tests should be much more faster.

I'll try to have a look soon and come with a proposal to address those
race conditions.

Thanks

Bertrand

[1]: /messages/by-id/20210802160133.uugcce5ql4m5mv5m@alap3.anarazel.de
/messages/by-id/20210802160133.uugcce5ql4m5mv5m@alap3.anarazel.de

#146

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Drouvot, Bertrand (#143)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi Andres,

On 8/6/21 1:27 PM, Drouvot, Bertrand wrote:

Hi,

On 8/2/21 6:01 PM, Andres Freund wrote:

While working on this I found a, somewhat substantial, issue:

If so, do you already have in mind a way to handle this? (I thought you
already had in mind a way to handle it so the question)

Yes. I think we need to add a condition variable to be able to wait for
WAL positions to change. Either multiple condition variables (one for
the flush position, one for the replay position), or one that just
changes more often. That way one can wait for apply without a race
condition.

Thanks for the feedback.

Wouldn't a condition variable on the replay position be enough? I
don't get why the proposed one on the flush position is needed.

Please find enclosed a patch proposal to address those corner cases.

I think (but may be wrong) that the condition variable on the flush
position would be needed only for the walsender(s) on non Standby node,
that's why:

* I made use of a condition variable on the replay position only.

* The walsender waits on it in WalSndWaitForWal() only if recovery is
in progress.

For simplicity to discuss those corner cases, this is a dedicated patch
that can be applied on top of v23 patches shared previously.

Thanks

Bertrand

Show quoted text

if not, what kind of additional  tests would you like to see?

A few catalog rows being removed (e.g. due to DELETE and then VACUUM
*without* full) and a standby without hot_standby_feedback catching
that.

Test added in v23 attached.

Thanks

Bertrand

Attachments:

v23-0006-Logical-Decoding-On-Standby-WalSender-Corner-Case.patchtext/plain; charset=UTF-8; name=v23-0006-Logical-Decoding-On-Standby-WalSender-Corner-Case.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6332ca5d53..ce53a236cc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -727,6 +727,7 @@ typedef struct XLogCtlData
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
+XLogCtlCvData *XLogCtlCv = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
@@ -5141,7 +5142,8 @@ void
 XLOGShmemInit(void)
 {
 	bool		foundCFile,
-				foundXLog;
+				foundXLog,
+				foundXLogCv;
 	char	   *allocptr;
 	int			i;
 	ControlFileData *localControlFile;
@@ -5166,14 +5168,17 @@ XLOGShmemInit(void)
 	XLogCtl = (XLogCtlData *)
 		ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);
 
+	XLogCtlCv = (XLogCtlCvData *)
+		ShmemInitStruct("XLOG Cv Ctl", sizeof(XLogCtlCvData), &foundXLogCv);
+
 	localControlFile = ControlFile;
 	ControlFile = (ControlFileData *)
 		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
 
-	if (foundCFile || foundXLog)
+	if (foundCFile || foundXLog || foundXLogCv)
 	{
-		/* both should be present or neither */
-		Assert(foundCFile && foundXLog);
+		/* All should be present or neither */
+		Assert(foundCFile && foundXLog && foundXLogCv);
 
 		/* Initialize local copy of WALInsertLocks */
 		WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
@@ -5183,6 +5188,7 @@ XLOGShmemInit(void)
 		return;
 	}
 	memset(XLogCtl, 0, sizeof(XLogCtlData));
+	memset(XLogCtlCv, 0, sizeof(XLogCtlCvData));
 
 	/*
 	 * Already have read control file locally, unless in bootstrap mode. Move
@@ -5244,6 +5250,7 @@ XLOGShmemInit(void)
 	SpinLockInit(&XLogCtl->ulsn_lck);
 	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogCtlCv->replayedCV);
 }
 
 /*
@@ -7533,6 +7540,11 @@ StartupXLOG(void)
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
 				SpinLockRelease(&XLogCtl->info_lck);
 
+				/*
+				 * wake up walsender(s) used by logical decoding on standby.
+				 */
+				ConditionVariableBroadcast(&XLogCtlCv->replayedCV);
+
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index da2533e1c9..4d07d28a31 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1394,6 +1394,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	XLogCtlCvData *xlogctlcv = XLogCtlCv;
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1412,7 +1413,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1496,20 +1496,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s).
+		 */
+		{
+			ConditionVariablePrepareToSleep(&xlogctlcv->replayedCV);
+			ConditionVariableSleep(&xlogctlcv->replayedCV, WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ef7e6bfb77..6e74e60630 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -448,6 +448,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 828106933c..7a2c04c937 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
@@ -48,6 +49,17 @@ extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
 extern void WalSndRqstFileReload(void);
 
+/*
+ * shared-memory state for Condition Variable(s)
+ * between the startup process and the walsender.
+ */
+typedef struct XLogCtlCvData
+{
+	ConditionVariable replayedCV;
+} XLogCtlCvData;
+
+extern XLogCtlCvData *XLogCtlCv;
+
 /*
  * Remember that we want to wakeup walsenders later
  *
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6007827b44..53d8ce85c5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -125,6 +125,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;

#147

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Drouvot, Bertrand (#139)

Re: Minimal logical decoding on standbys

Hi Alvaro,

On 8/2/21 4:56 PM, Drouvot, Bertrand wrote:

Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:
On 2021-Jul-27, Drouvot, Bertrand wrote:
diff --git a/src/backend/utils/cache/lsyscache.c 
b/src/backend/utils/cache/lsyscache.c
+bool
+get_rel_logical_catalog(Oid relid)
+{
+     bool    res;
+     Relation rel;
+
+     /* assume previously locked */
+     rel = table_open(relid, NoLock);
+     res = RelationIsAccessibleInLogicalDecoding(rel);
+     table_close(rel, NoLock);
+
+     return res;
+}
So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.
Thanks for looking at it.

Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.

So you have in mind to check for XLogLogicalInfoActive() first, and if
true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

If so, then what about also creating a new
RelationIsAccessibleWhileLogicalWalLevel() or something like this doing
the same as RelationIsAccessibleInLogicalDecoding() but without the
XLogLogicalInfoActive() check?

At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs

Do you have a "good place" in mind?

Thanks

Bertrand

#148

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Drouvot, Bertrand (#147)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 9/9/21 9:17 AM, Drouvot, Bertrand wrote:

Hi Alvaro,

On 8/2/21 4:56 PM, Drouvot, Bertrand wrote:
Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:
On 2021-Jul-27, Drouvot, Bertrand wrote:
diff --git a/src/backend/utils/cache/lsyscache.c 
b/src/backend/utils/cache/lsyscache.c
+bool
+get_rel_logical_catalog(Oid relid)
+{
+     bool    res;
+     Relation rel;
+
+     /* assume previously locked */
+     rel = table_open(relid, NoLock);
+     res = RelationIsAccessibleInLogicalDecoding(rel);
+     table_close(rel, NoLock);
+
+     return res;
+}
So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.
Thanks for looking at it.

Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding)
and
short-circuit for the check that can be done before opening the table.
So you have in mind to check for XLogLogicalInfoActive() first, and if
true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

If so, then what about also creating a new
RelationIsAccessibleWhileLogicalWalLevel() or something like this
doing the same as RelationIsAccessibleInLogicalDecoding() but without
the XLogLogicalInfoActive() check?

At least the GiST code appears to be able to call this several times
per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs

Do you have a "good place" in mind?

Another rebase attached.

The patch proposal to address Andre's walsender corner cases is still a
dedicated commit (as i think it may be easier to discuss).

Thanks

Bertrand

Attachments:

v24-0006-Fixing-Walsender-corner-cases-with-logical-decod.patchtext/plain; charset=UTF-8; name=v24-0006-Fixing-Walsender-corner-cases-with-logical-decod.patch; x-mac-creator=0; x-mac-type=0Download

From 22517556da34075ca1a3daadebc9a45ae58c6fa4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 15 Sep 2021 11:07:03 +0000
Subject: [PATCH v24 6/6] Fixing Walsender corner cases with logical decoding
 on standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Fixed by making used of a condition variable on the replay position.
---
 src/backend/access/transam/xlog.c       | 20 +++++++++++++----
 src/backend/replication/walsender.c     | 29 ++++++++++++++++++-------
 src/backend/utils/activity/wait_event.c |  3 +++
 src/include/replication/walsender.h     | 12 ++++++++++
 src/include/utils/wait_event.h          |  1 +
 5 files changed, 53 insertions(+), 12 deletions(-)
  29.6% src/backend/access/transam/
  52.7% src/backend/replication/
   4.1% src/backend/utils/activity/
  11.7% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c25a22a0c9..9843d2d6bf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -727,6 +727,7 @@ typedef struct XLogCtlData
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
+XLogCtlCvData *XLogCtlCv = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
@@ -5141,7 +5142,8 @@ void
 XLOGShmemInit(void)
 {
 	bool		foundCFile,
-				foundXLog;
+				foundXLog,
+				foundXLogCv;
 	char	   *allocptr;
 	int			i;
 	ControlFileData *localControlFile;
@@ -5166,14 +5168,17 @@ XLOGShmemInit(void)
 	XLogCtl = (XLogCtlData *)
 		ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);
 
+	XLogCtlCv = (XLogCtlCvData *)
+		ShmemInitStruct("XLOG Cv Ctl", sizeof(XLogCtlCvData), &foundXLogCv);
+
 	localControlFile = ControlFile;
 	ControlFile = (ControlFileData *)
 		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
 
-	if (foundCFile || foundXLog)
+	if (foundCFile || foundXLog || foundXLogCv)
 	{
-		/* both should be present or neither */
-		Assert(foundCFile && foundXLog);
+		/* All should be present or neither */
+		Assert(foundCFile && foundXLog && foundXLogCv);
 
 		/* Initialize local copy of WALInsertLocks */
 		WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
@@ -5183,6 +5188,7 @@ XLOGShmemInit(void)
 		return;
 	}
 	memset(XLogCtl, 0, sizeof(XLogCtlData));
+	memset(XLogCtlCv, 0, sizeof(XLogCtlCvData));
 
 	/*
 	 * Already have read control file locally, unless in bootstrap mode. Move
@@ -5244,6 +5250,7 @@ XLOGShmemInit(void)
 	SpinLockInit(&XLogCtl->ulsn_lck);
 	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogCtlCv->replayedCV);
 }
 
 /*
@@ -7542,6 +7549,11 @@ StartupXLOG(void)
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
 				SpinLockRelease(&XLogCtl->info_lck);
 
+				/*
+				 * wake up walsender(s) used by logical decoding on standby.
+				 */
+				ConditionVariableBroadcast(&XLogCtlCv->replayedCV);
+
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index da2533e1c9..4d07d28a31 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1394,6 +1394,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	XLogCtlCvData *xlogctlcv = XLogCtlCv;
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1412,7 +1413,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1496,20 +1496,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s).
+		 */
+		{
+			ConditionVariablePrepareToSleep(&xlogctlcv->replayedCV);
+			ConditionVariableSleep(&xlogctlcv->replayedCV, WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ef7e6bfb77..6e74e60630 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -448,6 +448,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 828106933c..7a2c04c937 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
@@ -48,6 +49,17 @@ extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
 extern void WalSndRqstFileReload(void);
 
+/*
+ * shared-memory state for Condition Variable(s)
+ * between the startup process and the walsender.
+ */
+typedef struct XLogCtlCvData
+{
+	ConditionVariable replayedCV;
+} XLogCtlCvData;
+
+extern XLogCtlCvData *XLogCtlCv;
+
 /*
  * Remember that we want to wakeup walsenders later
  *
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6007827b44..53d8ce85c5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -125,6 +125,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.18.4

v24-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v24-0005-Doc-changes-describing-details-about-logical-dec.patch; x-mac-creator=0; x-mac-type=0Download

From 175541ab59c3be162d440f62c25c64310b1015ee Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 15 Sep 2021 10:41:44 +0000
Subject: [PATCH v24 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index b6353c7a12..98fffc4352 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v24-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v24-0004-New-TAP-test-for-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From d3d12cc82e8cedd77737c416d10830844746013d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 15 Sep 2021 10:40:16 +0000
Subject: [PATCH v24 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgresNode.pm                 |  37 ++
 .../t/026_standby_logical_decoding.pl         | 498 ++++++++++++++++++
 2 files changed, 535 insertions(+)
   6.0% src/test/perl/
  93.9% src/test/recovery/t/

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index c59da758c7..ca0e63fb2a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2669,6 +2669,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/026_standby_logical_decoding.pl b/src/test/recovery/t/026_standby_logical_decoding.pl
new file mode 100644
index 0000000000..0513ccc353
--- /dev/null
+++ b/src/test/recovery/t/026_standby_logical_decoding.pl
@@ -0,0 +1,498 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use Test::More tests => 38;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgresNode->new('primary');
+my $node_standby = PostgresNode->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = TestLib::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level >= logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.18.4

v24-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v24-0003-Allow-logical-decoding-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From f3d6c8b55e7cb1d4cdc2c8efea011057ed9f614e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 15 Sep 2021 10:38:11 +0000
Subject: [PATCH v24 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 26 ++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 132 insertions(+), 63 deletions(-)
  16.7% src/backend/access/transam/
  32.7% src/backend/replication/logical/
  45.2% src/backend/replication/
   5.3% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e8082b4bc2..c25a22a0c9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5082,6 +5082,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9617,7 +9628,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
@@ -11755,7 +11766,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11767,6 +11778,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 88a1bfd939..aaade9382d 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -870,7 +870,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..ea1bf7d247 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -434,7 +434,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -794,7 +794,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..b8be0c83ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index aae0ae5b8a..1e8b2808d5 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 805d4c5a5b..7d8890a22a 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -214,7 +214,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 188716957f..81a3e72732 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1100,37 +1100,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1146,6 +1137,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 17df99c2ac..948acb1ccb 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -627,7 +627,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b90e5ca98e..28cc123c09 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -408,7 +408,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1098,7 +1098,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d0e247b104..da2533e1c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1174,6 +1174,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1398,7 +1408,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1432,7 +1442,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -2898,10 +2908,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -2990,7 +3002,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a8ede700d..1d44637bef 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -282,7 +282,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v24-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v24-0002-Handle-logical-slot-conflicts-on-standby.patch; x-mac-creator=0; x-mac-type=0Download

From 47bfde52e5ca52b8331f658ef3d8ea7a69259460 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 15 Sep 2021 10:36:17 +0000
Subject: [PATCH v24 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  10 +
 src/backend/access/gist/gistxlog.c            |   4 +-
 src/backend/access/hash/hash_xlog.c           |   3 +-
 src/backend/access/heap/heapam.c              |  10 +-
 src/backend/access/nbtree/nbtxlog.c           |   4 +-
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |  20 ++
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  11 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   2 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   4 +-
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 350 insertions(+), 15 deletions(-)
   4.1% src/backend/access/heap/
   4.7% src/backend/access/transam/
   5.1% src/backend/access/
   4.3% src/backend/postmaster/
   3.5% src/backend/replication/logical/
  52.2% src/backend/replication/
   5.6% src/backend/storage/ipc/
   7.1% src/backend/tcop/
   3.6% src/backend/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2281ba120f..51828b926f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4034,6 +4034,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 27475fcbd6..e5c6124400 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4019a2122e..75ca5f79f4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8419,7 +8419,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8587,7 +8588,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8724,7 +8727,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 786c08c0ce..41b7ec8e2d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -668,7 +668,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1006,6 +1007,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..e8082b4bc2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10400,6 +10400,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..cd1fc88d17 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3450a10129..4131a15029 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1824,6 +1824,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3534,6 +3550,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5474,6 +5491,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index e59939aad1..805d4c5a5b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -232,11 +232,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1c6c0c7ce2..188716957f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1350,6 +1350,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ca2a11389..d0e247b104 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1147,6 +1147,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c7816fcfb3..0c555a390c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3443,6 +3443,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 077251c1a6..28121f0658 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -34,6 +34,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3f9ed549f9..b6bd3fe36e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..073b402bac 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1497,6 +1497,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1540,6 +1555,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..e40f57a549 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5502,6 +5502,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 509849c7ff..15539dcc72 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -734,6 +734,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1024,6 +1025,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53d773ccff..cca4bb058c 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -214,6 +214,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -223,5 +224,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..7cf918634c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v24-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v24-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch; x-mac-creator=0; x-mac-type=0Download

From 280b8b170ebadb51764bbdf192eec600ed339b1c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 15 Sep 2021 10:32:45 +0000
Subject: [PATCH v24 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..6d948548c3 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 43ba03b6eb..1d1e21112c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -820,7 +820,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -864,7 +864,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fe9f0df20b..c2b67a25e1 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2433998f39..4019a2122e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7941,6 +7941,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7971,7 +7972,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7981,6 +7982,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 15ca1b304a..0590b7053c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd30..2597636bf0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ebec8fa5b8..2d27a3f974 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4ebaa552a2..2fda238870 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b4faa1c123..648aeacd78 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -365,6 +366,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -653,6 +657,11 @@ RelationGetSmgr(Relation rel)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

#149

Fabrízio de Royes Mello

fabriziomello@gmail.com

over 4 years ago

In reply to: Drouvot, Bertrand (#148)

Re: Minimal logical decoding on standbys

On Wed, Sep 15, 2021 at 8:36 AM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

Another rebase attached.

The patch proposal to address Andre's walsender corner cases is still a

dedicated commit (as i think it may be easier to discuss).

Did one more battery of tests and everything went well...

But doing some manually tests:

1. Setup master/replica (wal_level=logical, hot_standby_feedback=on, etc)
2. Initialize the master instance: "pgbench -i -s10 on master"
3. Terminal1: execute "pgbench -c20 -T 2000"
4. Terminal2: create the logical replication slot:

271480 (replica) fabrizio=# select * from
pg_create_logical_replication_slot('test_logical', 'test_decoding');
-[ RECORD 1 ]-----------
slot_name | test_logical
lsn | 1/C7C59E0

Time: 37658.725 ms (00:37.659)

5. Terminal3: start the pg_recvlogical

~/pgsql
➜ pg_recvlogical -p 5433 -S test_logical -d fabrizio -f - --start
pg_recvlogical: error: could not send replication command
"START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR: replication
slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command
"START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR: replication
slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command
"START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR: replication
slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command
"START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR: replication
slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command
"START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR: replication
slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command
"START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR: replication
slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command
"START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR: replication
slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
BEGIN 3767318
COMMIT 3767318
BEGIN 3767319
COMMIT 3767319
BEGIN 3767320
table public.pgbench_history: TRUNCATE: (no-flags)
COMMIT 3767320
BEGIN 3767323
table public.pgbench_accounts: UPDATE: aid[integer]:398507 bid[integer]:4
abalance[integer]:-1307 filler[character]:'
'
table public.pgbench_tellers: UPDATE: tid[integer]:17 bid[integer]:2
tbalance[integer]:-775356 filler[character]:null
table public.pgbench_branches: UPDATE: bid[integer]:4
bbalance[integer]:1862180 filler[character]:null
table public.pgbench_history: INSERT: tid[integer]:17 bid[integer]:4
aid[integer]:398507 delta[integer]:182 mtime[timestamp without time
zone]:'2021-09-17 17:25:19.811239' filler[character]:null
COMMIT 3767323
BEGIN 3767322
table public.pgbench_accounts: UPDATE: aid[integer]:989789 bid[integer]:10
abalance[integer]:1224 filler[character]:'
'
table public.pgbench_tellers: UPDATE: tid[integer]:86 bid[integer]:9
tbalance[integer]:-283737 filler[character]:null
table public.pgbench_branches: UPDATE: bid[integer]:9
bbalance[integer]:1277609 filler[character]:null
table public.pgbench_history: INSERT: tid[integer]:86 bid[integer]:9
aid[integer]:989789 delta[integer]:-2934 mtime[timestamp without time
zone]:'2021-09-17 17:25:19.811244' filler[character]:null
COMMIT 3767322

Even with activity on primary the creation of the logical replication slot
took ~38s. Can we do something related to it or should we need to clarify
even more the documentation?

Regards,

--
Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#150

Drouvot, Bertrand

bdrouvot@amazon.com

over 4 years ago

In reply to: Fabrízio de Royes Mello (#149)

Re: Minimal logical decoding on standbys

Hi,

On 9/17/21 10:32 PM, Fabrízio de Royes Mello wrote:

On Wed, Sep 15, 2021 at 8:36 AM Drouvot, Bertrand <bdrouvot@amazon.com
<mailto:bdrouvot@amazon.com>> wrote:

Another rebase attached.

The patch proposal to address Andre's walsender corner cases is

still a dedicated commit (as i think it may be easier to discuss).

Did one more battery of tests and everything went well...

Thanks for looking at it!

But doing some manually tests:

1. Setup master/replica (wal_level=logical, hot_standby_feedback=on, etc)
2. Initialize the master instance: "pgbench -i -s10 on master"
3. Terminal1: execute "pgbench -c20 -T 2000"
4. Terminal2: create the logical replication slot:

271480 (replica) fabrizio=# select * from
pg_create_logical_replication_slot('test_logical', 'test_decoding');
-[ RECORD 1 ]-----------
slot_name | test_logical
lsn | 1/C7C59E0

Time: 37658.725 ms (00:37.659)

Even with activity on primary the creation of the logical replication
slot took ~38s. Can we do something related to it or should we need to
clarify even more the documentation?

For the logical slot creation on the standby, as we can not do WAL
writes, we have to wait for xl_running_xact to be logged on the primary
and be replayed on the standby.

So we are somehow dependent on the checkpoints on the primary and
LOG_SNAPSHOT_INTERVAL_MS.

If we want to get rid of this, what i could think of is the standby
having to ask the primary to log a standby snapshot (until we get one we
are happy with).

Or, we may just want to mention in the doc:

+     For a logical slot to be created, it builds a historic snapshot, 
for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this 
information
+     has to be obtained from primary. So, creating a logical slot on 
standby
+     may take a noticeable time.

Instead of:

+     For a logical slot to be created, it builds a historic snapshot, 
for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this 
information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.

What do you think?

Thanks

Bertrand

#151

Drouvot, Bertrand

bdrouvot@amazon.com

about 4 years ago

In reply to: Drouvot, Bertrand (#148)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 9/15/21 1:36 PM, Drouvot, Bertrand wrote:

Hi,

On 9/9/21 9:17 AM, Drouvot, Bertrand wrote:
Hi Alvaro,

On 8/2/21 4:56 PM, Drouvot, Bertrand wrote:
Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:
On 2021-Jul-27, Drouvot, Bertrand wrote:
diff --git a/src/backend/utils/cache/lsyscache.c 
b/src/backend/utils/cache/lsyscache.c
+bool
+get_rel_logical_catalog(Oid relid)
+{
+     bool    res;
+     Relation rel;
+
+     /* assume previously locked */
+     rel = table_open(relid, NoLock);
+     res = RelationIsAccessibleInLogicalDecoding(rel);
+     table_close(rel, NoLock);
+
+     return res;
+}
So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.
Thanks for looking at it.

Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c,
proto in
relcache.h, perhaps call it
RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.
So you have in mind to check for XLogLogicalInfoActive() first, and
if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

If so, then what about also creating a new
RelationIsAccessibleWhileLogicalWalLevel() or something like this
doing the same as RelationIsAccessibleInLogicalDecoding() but without
the XLogLogicalInfoActive() check?

At least the GiST code appears to be able to call this several
times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs

Do you have a "good place" in mind?
Another rebase attached.

The patch proposal to address Andre's walsender corner cases is still
a dedicated commit (as i think it may be easier to discuss).

Another rebase attached (mainly to fix TAP tests failing due to b3b4d8e68a).

@Andres, the patch file number 6 contains an attempt to fix the
Walsender corner case you pointed out.

@Alvaro, I did not look at your remark yet. Do you have a "good place"
in mind? (related to "just stash the table's Relation pointer somewhere
in the context structs")

Given the size of this patch series, I'm wondering if we could start
committing piece per piece (while still working on the corner cases in
parallel).

That would maximize the amount of coverage it gets in the v15
development cycle.

What do you think?

Thanks

Bertrand

Attachments:

v25-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v25-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 748b1104bf2f5acfd9936e486412510d228039cd Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Oct 2021 14:30:22 +0000
Subject: [PATCH v25 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/gist/gist.c          |  2 +-
 src/backend/access/gist/gistbuild.c     |  2 +-
 src/backend/access/gist/gistutil.c      |  4 ++--
 src/backend/access/gist/gistxlog.c      |  4 +++-
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  2 +-
 src/backend/access/nbtree/nbtpage.c     | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c   |  8 ++++++++
 src/backend/utils/cache/lsyscache.c     | 15 +++++++++++++++
 src/include/access/gist_private.h       |  6 +++---
 src/include/access/gistxlog.h           |  3 ++-
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++++-
 src/include/access/nbtxlog.h            |  2 ++
 src/include/access/spgxlog.h            |  1 +
 src/include/utils/lsyscache.h           |  1 +
 src/include/utils/rel.h                 |  9 +++++++++
 19 files changed, 68 insertions(+), 15 deletions(-)
  17.9% src/backend/access/gist/
  13.1% src/backend/access/heap/
  14.5% src/backend/access/nbtree/
   8.9% src/backend/access/spgist/
   7.2% src/backend/utils/cache/
  19.6% src/include/access/
  16.6% src/include/utils/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..b6e6340c3c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..6d948548c3 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -290,7 +290,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 43ba03b6eb..1d1e21112c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -820,7 +820,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heapRel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -864,7 +864,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heapRel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6464cb9281..46aee6f2a9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -596,7 +596,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
+gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -607,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(heapRel);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index fe9f0df20b..c2b67a25e1 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -398,6 +398,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2da2be1696..3c45c583cd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7945,6 +7945,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -7975,7 +7976,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -7985,6 +7986,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index db6912e9fa..b848d0e5a3 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -323,6 +323,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114fbbdd30..2597636bf0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -282,7 +282,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
+				recptr = log_heap_visible(rel, heapBuf, vmBuf,
 										  cutoff_xid, flags);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5bc7c3616a..3fd59b2a3b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -36,6 +36,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
@@ -43,7 +44,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heapRel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
@@ -836,6 +838,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1296,7 +1299,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 static void
 _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable,
+					Relation heapRel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1362,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			RelationIsAccessibleInLogicalDecoding(heapRel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable,
-						updatable, nupdatable);
+						updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 76fb0374c4..3186885d14 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -27,6 +27,7 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 #include "utils/snapmgr.h"
+#include "utils/lsyscache.h"
 
 
 /* Entry in pending-list of TIDs we need to revisit */
@@ -503,6 +504,13 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	/*
+	 * There is no chance of endless recursion even when we are doing catalog
+	 * acceses here; because, spgist is never used for catalogs. Check
+	 * comments in RelationIsAccessibleInLogicalDecoding().
+	 */
+	xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
+
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4ebaa552a2..2fda238870 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -18,6 +18,7 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/table.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -2062,6 +2063,20 @@ get_rel_persistence(Oid relid)
 	return result;
 }
 
+bool
+get_rel_logical_catalog(Oid relid)
+{
+	bool	res;
+	Relation rel;
+
+	/* assume previously locked */
+	rel = table_open(relid, NoLock);
+	res = RelationIsAccessibleInLogicalDecoding(rel);
+	table_close(rel, NoLock);
+
+	return res;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..a0f4015556 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,8 +440,8 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  FullTransactionId latestRemovedXid);
+extern void gistXLogPageReuse(Relation heapRel, Relation rel,
+				  BlockNumber blkno, FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heapRel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index fd5144f258..73999ddc70 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,9 +49,9 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
-
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
@@ -97,6 +97,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 4353a32dbb..94c3292c1e 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 27db48184e..eba48b0aee 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool      onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -413,7 +416,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  bool *totally_frozen);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
-extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 0f7731856b..b15aa47f1b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileNode node;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 69405b5750..06b91f4d04 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 77871aaefc..e2a5efed30 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern bool get_rel_logical_catalog(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b4faa1c123..648aeacd78 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -365,6 +366,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -653,6 +657,11 @@ RelationGetSmgr(Relation rel)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-- 
2.18.4

v25-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v25-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 24b0541619a2a176e502120c3bb1c8e7d3ad2276 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Oct 2021 14:31:14 +0000
Subject: [PATCH v25 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_get_activity field:
confl_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  10 +
 src/backend/access/gist/gistxlog.c            |   4 +-
 src/backend/access/hash/hash_xlog.c           |   3 +-
 src/backend/access/heap/heapam.c              |  10 +-
 src/backend/access/nbtree/nbtxlog.c           |   4 +-
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |  20 ++
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  11 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   2 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   4 +-
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 350 insertions(+), 15 deletions(-)
   4.1% src/backend/access/heap/
   4.7% src/backend/access/transam/
   5.1% src/backend/access/
   4.3% src/backend/postmaster/
   3.5% src/backend/replication/logical/
  52.2% src/backend/replication/
   5.6% src/backend/storage/ipc/
   7.1% src/backend/tcop/
   3.6% src/backend/
   6.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3173ec2566..bd6723bd63 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4018,6 +4018,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of queries in this database that have been canceled due to
+       logical slots
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 46aee6f2a9..5963e639d8 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -195,7 +195,8 @@ gistRedoDeleteRecord(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
@@ -395,6 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 27475fcbd6..e5c6124400 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1002,7 +1002,8 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 		RelFileNode rnode;
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid,
+											xldata->onCatalogTable, rnode);
 	}
 
 	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3c45c583cd..43624603a0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8423,7 +8423,8 @@ heap_xlog_prune(XLogReaderState *record)
 	 * no queries running for which the removed tuples are still visible.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
@@ -8591,7 +8592,9 @@ heap_xlog_visible(XLogReaderState *record)
 	 * rather than killing the transaction outright.
 	 */
 	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid,
+											xlrec->onCatalogTable,
+											rnode);
 
 	/*
 	 * Read the heap page, if it still exists. If the heap file has dropped or
@@ -8728,7 +8731,8 @@ heap_xlog_freeze_page(XLogReaderState *record)
 		TransactionIdRetreat(latestRemovedXid);
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 786c08c0ce..41b7ec8e2d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -668,7 +668,8 @@ btree_xlog_delete(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
 
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->onCatalogTable, rnode);
 	}
 
 	/*
@@ -1006,6 +1007,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->onCatalogTable,
 												   xlrec->node);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 3dfd2aa317..add4da4e74 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -881,6 +881,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 			XLogRecGetBlockTag(record, 0, &node, NULL, NULL);
 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+												xldata->onCatalogTable,
 												node);
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f547efd294..0d2590ebeb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10577,6 +10577,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..cd1fc88d17 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1020,7 +1020,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..4fad4a9b1c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1869,6 +1869,22 @@ pgstat_report_replslot_drop(const char *slotname)
 	pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
 }
 
+/* ----------
+ * pgstat_report_replslot_conflict()
+ * Tell the collector about a logical slot being conflicting
+ * with recovery.
+ * ----------
+ */
+void
+pgstat_report_replslot_conflict(Oid dboid)
+{
+	PgStat_MsgRecoveryConflict msg;
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+	msg.m_databaseid = dboid;
+	msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_ping() -
  *
@@ -3633,6 +3649,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->n_conflict_tablespace = 0;
 	dbentry->n_conflict_lock = 0;
 	dbentry->n_conflict_snapshot = 0;
+	dbentry->n_conflict_logicalslot = 0;
 	dbentry->n_conflict_bufferpin = 0;
 	dbentry->n_conflict_startup_deadlock = 0;
 	dbentry->n_temp_files = 0;
@@ -5580,6 +5597,9 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			dbentry->n_conflict_snapshot++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index e59939aad1..805d4c5a5b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -232,11 +232,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1c6c0c7ce2..188716957f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1350,6 +1350,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) kill(active_pid, SIGTERM);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_report_replslot_conflict(s->data.database);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d9ab6d6de2..63c0dd2f62 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1246,6 +1246,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bd3c7a47fe..23ffc39a0a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3443,6 +3443,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index defb75aa26..315d5a1e33 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index b17326bc20..111fec1b9f 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -34,6 +34,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -440,7 +441,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									bool onCatalogTable, RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -465,6 +467,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogTable)
+		InvalidateConflictingLogicalReplicationSlots(node.dbNode, latestRemovedXid);
 }
 
 /*
@@ -473,7 +478,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-										   RelFileNode node)
+										   bool onCatalogTable, RelFileNode node)
 {
 	/*
 	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
@@ -491,7 +496,7 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXi
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, onCatalogTable, node);
 	}
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0775abe35d..697b0ea7ad 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2467,6 +2467,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3037,6 +3040,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..073b402bac 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1497,6 +1497,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1540,6 +1555,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..e40f57a549 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5502,6 +5502,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '4544',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..3570e4933a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -747,6 +747,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
@@ -1038,6 +1039,7 @@ extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
 extern void pgstat_report_replslot_create(const char *slotname);
 extern void pgstat_report_replslot_drop(const char *slotname);
+extern void pgstat_report_replslot_conflict(Oid dbOid);
 
 extern void pgstat_initialize(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53d773ccff..cca4bb058c 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -214,6 +214,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, int szslot);
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
@@ -223,5 +224,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index eec186be2e..f86b070dbc 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 38fd85a431..3ba1882216 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,9 +30,9 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
-												RelFileNode node);
+												bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
-													   RelFileNode node);
+													   bool onCatalogTable, RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..7cf918634c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,7 +1871,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.18.4

v25-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v25-0003-Allow-logical-decoding-on-standby.patchDownload

From 178a6d1c8fecf13164a5b65a8d5677b92dc40aa6 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Oct 2021 14:32:11 +0000
Subject: [PATCH v25 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c             | 33 ++++++++++-
 src/backend/access/transam/xlogfuncs.c        |  2 +-
 src/backend/access/transam/xlogutils.c        |  2 +-
 src/backend/postmaster/checkpointer.c         |  4 +-
 src/backend/replication/logical/decode.c      | 22 +++++++-
 src/backend/replication/logical/logical.c     | 37 ++++++------
 .../replication/logical/logicalfuncs.c        |  2 +-
 src/backend/replication/slot.c                | 56 ++++++++++---------
 src/backend/replication/slotfuncs.c           |  2 +-
 src/backend/replication/walreceiver.c         |  4 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           | 28 +++++++---
 src/include/access/xlog.h                     |  3 +-
 13 files changed, 133 insertions(+), 64 deletions(-)
  16.4% src/backend/access/transam/
  32.2% src/backend/replication/logical/
  46.0% src/backend/replication/
   5.2% src/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2590ebeb..1e9f94927c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5123,6 +5123,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
@@ -9787,7 +9798,7 @@ CreateRestartPoint(int flags)
 	 * whichever is later.
 	 */
 	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	if (InvalidateObsoleteReplicationSlots(_logSegNo))
@@ -11952,7 +11963,7 @@ register_persistent_abort_backup_handler(void)
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(TimeLineID *replayTLI)
+GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header)
 {
 	XLogRecPtr	recptr;
 	TimeLineID	tli;
@@ -11964,6 +11975,24 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 
 	if (replayTLI)
 		*replayTLI = tli;
+
+	/*
+	 * Replay pointer may point one past the end of the record. If that
+	 * is a XLOG page boundary, it will not be a valid LSN for the
+	 * start of a record, so bump it up past the page header.
+	 */
+	if (avoid_header && !XRecOffIsValid(recptr))
+	{
+		if (recptr % XLOG_BLCKSZ != 0)
+			elog(ERROR, "invalid replay pointer");
+
+		/* For the first page of a segment file, it's a long header */
+		if (XLogSegmentOffset(recptr, wal_segment_size) == 0)
+			recptr += SizeOfXLogLongPHD;
+		else
+			recptr += SizeOfXLogShortPHD;
+	}
+
 	return recptr;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..a173f8d6fc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -417,7 +417,7 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, false);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 88a1bfd939..aaade9382d 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -870,7 +870,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 		tli = ThisTimeLineID;
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..ea1bf7d247 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -434,7 +434,7 @@ CheckpointerMain(void)
 			 */
 			ckpt_active = true;
 			if (do_restartpoint)
-				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL, false);
 			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
@@ -794,7 +794,7 @@ IsCheckpointOnSchedule(double progress)
 	 * value that was in effect when the WAL was generated).
 	 */
 	if (RecoveryInProgress())
-		recptr = GetXLogReplayRecPtr(NULL);
+		recptr = GetXLogReplayRecPtr(NULL, false);
 	else
 		recptr = GetInsertRecPtr();
 	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) /
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2874dc0612..b8be0c83ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,11 +215,31 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level >= logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index aae0ae5b8a..1e8b2808d5 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level >= logical on master")));
+	}
 }
 
 /*
@@ -330,6 +329,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 805d4c5a5b..7d8890a22a 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -214,7 +214,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	if (!RecoveryInProgress())
 		end_of_wal = GetFlushRecPtr();
 	else
-		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID);
+		end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID, false);
 
 	ReplicationSlotAcquire(NameStr(*name), true);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 188716957f..81a3e72732 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1100,37 +1100,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL, true);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1146,6 +1137,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 17df99c2ac..948acb1ccb 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -627,7 +627,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (!RecoveryInProgress())
 		moveto = Min(moveto, GetFlushRecPtr());
 	else
-		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID));
+		moveto = Min(moveto, GetXLogReplayRecPtr(&ThisTimeLineID, false));
 
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b90e5ca98e..28cc123c09 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -408,7 +408,7 @@ WalReceiverMain(void)
 			first_stream = false;
 
 			/* Initialize LogstreamResult and buffers for processing messages */
-			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL, false);
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
@@ -1098,7 +1098,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = GetXLogReplayRecPtr(NULL, false);
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..f264b71f73 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -372,7 +372,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, false);
 
 	if (receivePtr == replayPtr)
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 63c0dd2f62..2ea4d27b63 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -535,7 +535,7 @@ ReadReplicationSlot(ReadReplicationSlotCmd *cmd)
 			 * to get the LSN position's history.
 			 */
 			if (RecoveryInProgress())
-				(void) GetXLogReplayRecPtr(&current_timeline);
+				(void) GetXLogReplayRecPtr(&current_timeline, false);
 			else
 				current_timeline = ThisTimeLineID;
 
@@ -1273,6 +1273,16 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/*
+	 * In case of logical decoding on standby it may be that ThisTimeLineID
+	 * is not set yet.
+	 * Indeed we are not going through InitXLOGAccess on a Standby and
+	 * it may also be that IdentifySystem has not been called yet.
+	 * So let's get it through GetXLogReplayRecPtr().
+	 */
+	if (ThisTimeLineID == 0)
+		(void) GetXLogReplayRecPtr(&ThisTimeLineID, false);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1497,7 +1507,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr();
 	else
-		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 	for (;;)
 	{
@@ -1531,7 +1541,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (!RecoveryInProgress())
 			RecentFlushPtr = GetFlushRecPtr();
 		else
-			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+			RecentFlushPtr = GetXLogReplayRecPtr(NULL, false);
 
 		/*
 		 * If postmaster asked us to stop, don't wait anymore.
@@ -3004,10 +3014,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr();
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr();
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr() : GetFlushRecPtr());
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3096,7 +3108,7 @@ GetStandbyFlushRecPtr(void)
 	 */
 
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
-	replayPtr = GetXLogReplayRecPtr(&replayTLI);
+	replayPtr = GetXLogReplayRecPtr(&replayTLI, false);
 
 	ThisTimeLineID = replayTLI;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5e2c94a05f..97061253e6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -282,7 +282,7 @@ extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI, bool avoid_header);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern RecoveryPauseState GetRecoveryPauseState(void);
@@ -299,6 +299,7 @@ extern Size XLOGShmemSize(void);
 extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
-- 
2.18.4

v25-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v25-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 873882b8dad1969124016785bc61243a658632a7 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Oct 2021 14:33:52 +0000
Subject: [PATCH v25 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 .../t/027_standby_logical_decoding.pl         | 498 ++++++++++++++++++
 2 files changed, 535 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.9% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 86eb920ea1..89e49facb4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2804,6 +2804,43 @@ sub pg_recvlogical_upto
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/027_standby_logical_decoding.pl b/src/test/recovery/t/027_standby_logical_decoding.pl
new file mode 100644
index 0000000000..cb75f2b0c3
--- /dev/null
+++ b/src/test/recovery/t/027_standby_logical_decoding.pl
@@ -0,0 +1,498 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 38;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level >= logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(100_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.18.4

v25-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v25-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 7b10524f2e81e4a0beb160e6297cfb6d601b6d64 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Oct 2021 14:34:37 +0000
Subject: [PATCH v25 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index b6353c7a12..98fffc4352 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the 
+     connection is alive (for example a node restart would break it). Existing 
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.18.4

v25-0006-Fixing-Walsender-corner-cases-with-logical-decod.patchtext/plain; charset=UTF-8; name=v25-0006-Fixing-Walsender-corner-cases-with-logical-decod.patchDownload

From 95cdd939b709a2a1f1462b6a92c6cc201a3adde7 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 26 Oct 2021 14:35:23 +0000
Subject: [PATCH v25 6/6] Fixing Walsender corner cases with logical decoding
 on standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Fixed by making used of a condition variable on the replay position.
---
 src/backend/access/transam/xlog.c       | 20 +++++++++++++----
 src/backend/replication/walsender.c     | 29 ++++++++++++++++++-------
 src/backend/utils/activity/wait_event.c |  3 +++
 src/include/replication/walsender.h     | 12 ++++++++++
 src/include/utils/wait_event.h          |  1 +
 5 files changed, 53 insertions(+), 12 deletions(-)
  29.6% src/backend/access/transam/
  52.7% src/backend/replication/
   4.1% src/backend/utils/activity/
  11.7% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e9f94927c..3d125228fe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -737,6 +737,7 @@ typedef struct XLogCtlData
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
+XLogCtlCvData *XLogCtlCv = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
@@ -5182,7 +5183,8 @@ void
 XLOGShmemInit(void)
 {
 	bool		foundCFile,
-				foundXLog;
+				foundXLog,
+				foundXLogCv;
 	char	   *allocptr;
 	int			i;
 	ControlFileData *localControlFile;
@@ -5207,14 +5209,17 @@ XLOGShmemInit(void)
 	XLogCtl = (XLogCtlData *)
 		ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);
 
+	XLogCtlCv = (XLogCtlCvData *)
+		ShmemInitStruct("XLOG Cv Ctl", sizeof(XLogCtlCvData), &foundXLogCv);
+
 	localControlFile = ControlFile;
 	ControlFile = (ControlFileData *)
 		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
 
-	if (foundCFile || foundXLog)
+	if (foundCFile || foundXLog || foundXLogCv)
 	{
-		/* both should be present or neither */
-		Assert(foundCFile && foundXLog);
+		/* All should be present or neither */
+		Assert(foundCFile && foundXLog && foundXLogCv);
 
 		/* Initialize local copy of WALInsertLocks */
 		WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
@@ -5224,6 +5229,7 @@ XLOGShmemInit(void)
 		return;
 	}
 	memset(XLogCtl, 0, sizeof(XLogCtlData));
+	memset(XLogCtlCv, 0, sizeof(XLogCtlCvData));
 
 	/*
 	 * Already have read control file locally, unless in bootstrap mode. Move
@@ -5285,6 +5291,7 @@ XLOGShmemInit(void)
 	SpinLockInit(&XLogCtl->ulsn_lck);
 	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogCtlCv->replayedCV);
 }
 
 /*
@@ -7684,6 +7691,11 @@ StartupXLOG(void)
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
 				SpinLockRelease(&XLogCtl->info_lck);
 
+				/*
+				 * wake up walsender(s) used by logical decoding on standby.
+				 */
+				ConditionVariableBroadcast(&XLogCtlCv->replayedCV);
+
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2ea4d27b63..8460413030 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1493,6 +1493,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	XLogCtlCvData *xlogctlcv = XLogCtlCv;
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1511,7 +1512,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1595,20 +1595,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s).
+		 */
+		{
+			ConditionVariablePrepareToSleep(&xlogctlcv->replayedCV);
+			ConditionVariableSleep(&xlogctlcv->replayedCV, WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 4a5b7502f5..92dc17baf1 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -448,6 +448,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 828106933c..7a2c04c937 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
@@ -48,6 +49,17 @@ extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
 extern void WalSndRqstFileReload(void);
 
+/*
+ * shared-memory state for Condition Variable(s)
+ * between the startup process and the walsender.
+ */
+typedef struct XLogCtlCvData
+{
+	ConditionVariable replayedCV;
+} XLogCtlCvData;
+
+extern XLogCtlCvData *XLogCtlCv;
+
 /*
  * Remember that we want to wakeup walsenders later
  *
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index c22142365f..b1a27f5e84 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -125,6 +125,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.18.4

#152

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Drouvot, Bertrand (#151)

Re: Minimal logical decoding on standbys

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.
That could do arbitrary amounts of work taking any number of other
buffer locks, which could easily deadlock (and the deadlock detector
wouldn't help, since these are lwlocks). Even if that were no issue,
we really, really do not want to write code that could result in large
numbers of additional calls to table_open() -- and _bt_getbuf() is
certainly a frequently-used function. I think that, in order to have
any chance of being acceptable, this would need to be restructured so
that it pulls data from an existing relcache entry that is known to be
valid, without attempting to create a new one. That is,
get_rel_logical_decoding() would need to take a Relation argument, not
an OID.

I also think it's super-weird that the value being logged is computed
using RelationIsAccessibleInLogicalDecoding(). That means that if
wal_level < logical, we'll set onCatalogTable = false in the xlog
record, regardless of whether that's true or not. Now I suppose it
won't matter, because presumably this field is only going to be
consulted for whatever purpose when logical replication is active, but
I object on principle to the idea of a field whose name suggests that
it means one thing and whose value is inconsistent with that
interpretation.

Regarding 0003, I notice that GetXLogReplayRecPtr() gets an extra
argument that is set to false everywhere except one place that is
inside the new code. That suggests to me that putting logic that the
other 15 callers don't need is not the right approach here. It also
looks like, in the one place where that argument does get passed as
true, LogStandbySnapshot() moves outside the retry loop. I think
that's unlikely to be correct.

I also notice that 0003 deletes a comment that says "We need to force
hot_standby_feedback to be enabled at all times so the primary cannot
remove rows we need," but also that this is the only mention of
hot_standby_feedback in the entire patch set. If the existing comment
that we need to do something about that is incorrect, we should update
it independently of this patch set to be correct. But if the existing
comment is correct then there ought to be something in the patch that
deals with it.

Another part of that same deleted comment says "We need to be able to
correctly and quickly identify the timeline LSN belongs to," but I
don't see what the patch does about that, either. I'm actually not
sure exactly what that's talking about, but today for unrelated
reasons I happened to be looking at logical_read_xlog_page(), which is
actually what caused me to look at this thread. In that function we
have, as the first two lines of executable code:

XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

The second line of code depends on the value of ThisTimeLineID. The
first line of code does too, because XLogReadDetermineTimeline() uses
that variable internally. If logical decoding is only allowed on a
primary, then there can't really be an issue here, because we will
have checked RecoveryInProgress() in
CheckLogicalDecodingRequirements() and ThisTimeLineID will have its
final value. But on a standby, I'm not sure that ThisTimeLineID even
has to be initialized here, and I really can't see any reason at all
why the value it contains is necessarily still current. This
function's sister, read_local_xlog_page(), contains a bunch of logic
that tries to make sure that we're always reading every record from
the right timeline, but there's nothing similar here. I think that
would likely have to be fixed in order for decoding to work on
standbys, but maybe I'm missing something.

--
Robert Haas
EDB: http://www.enterprisedb.com

#153

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Robert Haas (#152)

Re: Minimal logical decoding on standbys

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.

That could do arbitrary amounts of work taking any number of other
buffer locks, which could easily deadlock (and the deadlock detector
wouldn't help, since these are lwlocks). Even if that were no issue,
we really, really do not want to write code that could result in large
numbers of additional calls to table_open() -- and _bt_getbuf() is
certainly a frequently-used function.

The BTPageIsRecyclable() path hopefully less so. Not that that makes it OK.

I think that, in order to have
any chance of being acceptable, this would need to be restructured so
that it pulls data from an existing relcache entry that is known to be
valid, without attempting to create a new one. That is,
get_rel_logical_decoding() would need to take a Relation argument, not
an OID.

Hm? Once we have a relation we don't really need the helper function anymore.

I also think it's super-weird that the value being logged is computed
using RelationIsAccessibleInLogicalDecoding(). That means that if
wal_level < logical, we'll set onCatalogTable = false in the xlog
record, regardless of whether that's true or not. Now I suppose it
won't matter, because presumably this field is only going to be
consulted for whatever purpose when logical replication is active, but
I object on principle to the idea of a field whose name suggests that
it means one thing and whose value is inconsistent with that
interpretation.

Hm. Not sure what a good solution for this is. I don't think we should make
the field independent of wal_level - it doesn't really mean anything with a
lower wal_level. And it increases the illusion that the table is guaranteed to
be a system table or something a bit. Perhaps the field name should hint at
this being logically decoding related?

I also notice that 0003 deletes a comment that says "We need to force
hot_standby_feedback to be enabled at all times so the primary cannot
remove rows we need," but also that this is the only mention of
hot_standby_feedback in the entire patch set. If the existing comment
that we need to do something about that is incorrect, we should update
it independently of this patch set to be correct. But if the existing
comment is correct then there ought to be something in the patch that
deals with it.

The patch deals with this - we'll detect the removal of row versions that
aren't needed anymore and stop decoding. Of course you'll most of the time
want to use hs_feedback, but sometimes it'll also just be a companion slot on
the primary or such (think slots for failover or such).

Another part of that same deleted comment says "We need to be able to
correctly and quickly identify the timeline LSN belongs to," but I
don't see what the patch does about that, either. I'm actually not
sure exactly what that's talking about

Hm - could you expand on what you're unclear about re LSN->timeline? It's just
that we need to read a WAL page for a certain LSN, and for that we need the
timeline?

, but today for unrelated
reasons I happened to be looking at logical_read_xlog_page(), which is
actually what caused me to look at this thread. In that function we
have, as the first two lines of executable code:

XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

The second line of code depends on the value of ThisTimeLineID. The
first line of code does too, because XLogReadDetermineTimeline() uses
that variable internally. If logical decoding is only allowed on a
primary, then there can't really be an issue here, because we will
have checked RecoveryInProgress() in
CheckLogicalDecodingRequirements() and ThisTimeLineID will have its
final value. But on a standby, I'm not sure that ThisTimeLineID even
has to be initialized here, and I really can't see any reason at all
why the value it contains is necessarily still current.

I think the code tries to deal with this via XLogReadDetermineTimeline(),
which limits up to where WAL is valid on the current timeline, based on the
timeline history file. But as you say, it does rely on ThisTimeLineID for
that, and it's not obvious why it's likely current, let alone guaranteed to be
current.

This function's sister, read_local_xlog_page(), contains a bunch of logic
that tries to make sure that we're always reading every record from the
right timeline, but there's nothing similar here. I think that would likely
have to be fixed in order for decoding to work on standbys, but maybe I'm
missing something.

I think that part actually works, afaict they both rely on the same
XLogReadDetermineTimeline() for that job afaict. What might be missing is
logic to update the target timeline.

Greetings,

Andres Freund

#154

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Andres Freund (#153)

Re: Minimal logical decoding on standbys

On Thu, Oct 28, 2021 at 5:07 PM Andres Freund <andres@anarazel.de> wrote:

I think that, in order to have
any chance of being acceptable, this would need to be restructured so
that it pulls data from an existing relcache entry that is known to be
valid, without attempting to create a new one. That is,
get_rel_logical_decoding() would need to take a Relation argument, not
an OID.

Hm? Once we have a relation we don't really need the helper function anymore.

Well, that's fine, too.

I also think it's super-weird that the value being logged is computed
using RelationIsAccessibleInLogicalDecoding(). That means that if
wal_level < logical, we'll set onCatalogTable = false in the xlog
record, regardless of whether that's true or not. Now I suppose it
won't matter, because presumably this field is only going to be
consulted for whatever purpose when logical replication is active, but
I object on principle to the idea of a field whose name suggests that
it means one thing and whose value is inconsistent with that
interpretation.

Hm. Not sure what a good solution for this is. I don't think we should make
the field independent of wal_level - it doesn't really mean anything with a
lower wal_level. And it increases the illusion that the table is guaranteed to
be a system table or something a bit. Perhaps the field name should hint at
this being logically decoding related?

Not sure - I don't know what this is for. I did wonder if maybe it
should be testing IsCatalogRelation(relation) ||
RelationIsUsedAsCatalogTable(relation) i.e.
RelationIsAccessibleInLogicalDecoding() with the removal of the
XLogLogicalInfoActive() and RelationNeedsWAL() tests. But since I
don't know what I'm talking about, all I can say for sure right now is
that the field name and the field contents don't seem to align.

I also notice that 0003 deletes a comment that says "We need to force
hot_standby_feedback to be enabled at all times so the primary cannot
remove rows we need," but also that this is the only mention of
hot_standby_feedback in the entire patch set. If the existing comment
that we need to do something about that is incorrect, we should update
it independently of this patch set to be correct. But if the existing
comment is correct then there ought to be something in the patch that
deals with it.

The patch deals with this - we'll detect the removal of row versions that
aren't needed anymore and stop decoding. Of course you'll most of the time
want to use hs_feedback, but sometimes it'll also just be a companion slot on
the primary or such (think slots for failover or such).

Where and how does this happen?

Another part of that same deleted comment says "We need to be able to
correctly and quickly identify the timeline LSN belongs to," but I
don't see what the patch does about that, either. I'm actually not
sure exactly what that's talking about

Hm - could you expand on what you're unclear about re LSN->timeline? It's just
that we need to read a WAL page for a certain LSN, and for that we need the
timeline?

I don't know - I'm trying to understand the meaning of a comment that
I think you wrote originally.

This function's sister, read_local_xlog_page(), contains a bunch of logic
that tries to make sure that we're always reading every record from the
right timeline, but there's nothing similar here. I think that would likely
have to be fixed in order for decoding to work on standbys, but maybe I'm
missing something.

I think that part actually works, afaict they both rely on the same
XLogReadDetermineTimeline() for that job afaict. What might be missing is
logic to update the target timeline.

Hmm, OK, perhaps I mis-spoke, but I think we're talking about the same
thing. read_local_xlog_page() has this:

* RecoveryInProgress() will update ThisTimeLineID when it first
* notices recovery finishes, so we only have to maintain it for the
* local process until recovery ends.
*/
if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
tli = ThisTimeLineID;

That's a bulletproof guarantee that "tli" and "ThisTimeLineID" are up
to date. The other function has nothing similar.

--
Robert Haas
EDB: http://www.enterprisedb.com

#155

Drouvot, Bertrand

bdrouvot@amazon.com

almost 4 years ago

In reply to: Andres Freund (#153)

Re: Minimal logical decoding on standbys

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.

+1 for the idea of propagating to the index. If that sounds good to you
too, I can try to have a look at it.

Thanks Robert and Andres for the feedbacks you have done on the various
sub-patches.

I've now in mind to work sub patch by sub patch (starting with 0001
then) and move to the next one once we agree that the current one is
"ready".

I think that could help us to get this new feature moving forward more
"easily", what do you think?

Thanks

Bertrand

#156

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Drouvot, Bertrand (#155)

Re: Minimal logical decoding on standbys

Hi,

On 2/25/22 10:34 AM, Drouvot, Bertrand wrote:

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand
<bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and
if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard
to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache
entry is
for a catalog from the table to the index. Then we'd not need to
change the
btree code to pass the table down.

+1 for the idea of propagating to the index. If that sounds good to
you too, I can try to have a look at it.

Thanks Robert and Andres for the feedbacks you have done on the
various sub-patches.

I've now in mind to work sub patch by sub patch (starting with 0001
then) and move to the next one once we agree that the current one is
"ready".

I think that could help us to get this new feature moving forward more
"easily", what do you think?

Thanks

Bertrand

I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

- It would give more visibility in case others want to jump in

Hope that makes sense,

Thanks,

Bertrand

#157

Ibrar Ahmed

ibrar.ahmad@gmail.com

over 3 years ago

In reply to: Drouvot, Bertrand (#156)

Re: Minimal logical decoding on standbys

On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

Hi,

On 2/25/22 10:34 AM, Drouvot, Bertrand wrote:

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand
<bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and
if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard
to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache
entry is
for a catalog from the table to the index. Then we'd not need to
change the
btree code to pass the table down.

+1 for the idea of propagating to the index. If that sounds good to
you too, I can try to have a look at it.

Thanks Robert and Andres for the feedbacks you have done on the
various sub-patches.

I've now in mind to work sub patch by sub patch (starting with 0001
then) and move to the next one once we agree that the current one is
"ready".

I think that could help us to get this new feature moving forward more
"easily", what do you think?

Thanks

Bertrand

I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

I have already done some research on that, I can definitely look at it.

- It would give more visibility in case others want to jump in

Hope that makes sense,

Thanks,

Bertrand

--
Ibrar Ahmed

#158

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Ibrar Ahmed (#157)

Re: Minimal logical decoding on standbys

Hi,

On 7/1/22 10:03 PM, Ibrar Ahmed wrote:

On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand
<bdrouvot@amazon.com> wrote:

I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

I have already done some research on that, I can definitely look at it.

Thanks!

This feature proposal is currently made of 5 sub-patches:

0001: Add info in WAL records in preparation for logical slot conflict
handling
0002: Handle logical slot conflicts on standby
0003: Allow logical decoding on standby.
0004: New TAP test for logical decoding on standby
0005: Doc changes describing details about logical decoding

I suggest that we focus on one sub-patch at a time.

I'll start with 0001 and come back with a rebase addressing Andres and
Robert's previous comments.

Sounds good to you?

Thanks

--
Bertrand Drouvot
Amazon Web Services:https://aws.amazon.com

#159

Ibrar Ahmed

ibrar.ahmad@gmail.com

over 3 years ago

In reply to: Drouvot, Bertrand (#158)

Re: Minimal logical decoding on standbys

On Mon, Jul 4, 2022 at 6:12 PM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

Hi,
On 7/1/22 10:03 PM, Ibrar Ahmed wrote:

On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com>
wrote:

I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

I have already done some research on that, I can definitely look at it.

Thanks!

This feature proposal is currently made of 5 sub-patches:

0001: Add info in WAL records in preparation for logical slot conflict
handling
0002: Handle logical slot conflicts on standby
0003: Allow logical decoding on standby.
0004: New TAP test for logical decoding on standby
0005: Doc changes describing details about logical decoding

I suggest that we focus on one sub-patch at a time.

I'll start with 0001 and come back with a rebase addressing Andres and
Robert's previous comments.

Sounds good to you?

Thanks

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

That's great I am looking at "0002: Handle logical slot conflicts on

standby".

--
Ibrar Ahmed

#160

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Andres Freund (#153)

Re: Minimal logical decoding on standbys

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.

Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me
that the missing part to be able to tell whether or not an index is for
a catalog is the rd_options->user_catalog_table value of its related
heap relation.

Then, a way to achieve that could be to:

- Add to Relation a new "heap_rd_options" representing the rd_options of
the related heap relation when appropriate

- Trigger the related indexes relcache invalidations when an
ATExecSetRelOptions() is triggered on a heap relation

- Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that
would make use of the heap_rd_options instead

Does that sound like a valid option to you or do you have another idea
in mind to propagate the information whether a relcache entry is for a
catalog from the table to the index?

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#161

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

over 3 years ago

In reply to: Drouvot, Bertrand (#160)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand
<bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and
if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard
to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache
entry is
for a catalog from the table to the index. Then we'd not need to
change the
btree code to pass the table down.

Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me
that the missing part to be able to tell whether or not an index is for
a catalog is the rd_options->user_catalog_table value of its related
heap relation.

Then, a way to achieve that could be to:

- Add to Relation a new "heap_rd_options" representing the rd_options of
the related heap relation when appropriate

- Trigger the related indexes relcache invalidations when an
ATExecSetRelOptions() is triggered on a heap relation

- Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that
would make use of the heap_rd_options instead

Does that sound like a valid option to you or do you have another idea
in mind to propagate the information whether a relcache entry is for a
catalog from the table to the index?

I ended up with the attached proposal to propagate the catalog
information to the indexes.

The attached adds a new field "isusercatalog" in pg_index to indicate
whether or not the index is linked to a table that has the storage
parameter user_catalog_table set to true.

Then it defines new macros, including
"IndexIsAccessibleInLogicalDecoding" making use of this new field.

This new macro replaces get_rel_logical_catalog() that was part of the
previous patch version.

What do you think about this approach and the attached?

If that sounds reasonable, then I'll add tap tests for it and try to
improve the way isusercatalog is propagated to the index(es) in case a
reset is done on user_catalog_table on the table (currently in this POC
patch, it's hardcoded to "false" which is the default value for
user_catalog_table in boolRelOpts[]) (A better approach would be
probably to retrieve the value from the table once the reset is done and
then propagate it to the index(es).)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v26-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v26-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 0dc838a0fdb0f8daccc398bea4f11152b68e0c3a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 30 Sep 2022 09:28:09 +0000
Subject: [PATCH v26] Add info in WAL records in preparation for logical slot
 conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  3 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  3 ++
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 14 ++++--
 src/backend/commands/tablecmds.c        | 57 +++++++++++++++++++++++++
 src/include/access/gistxlog.h           |  2 +
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++-
 src/include/access/nbtxlog.h            |  2 +
 src/include/access/spgxlog.h            |  1 +
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 33 ++++++++++++++
 18 files changed, 137 insertions(+), 8 deletions(-)
   8.3% doc/src/sgml/
   9.5% src/backend/access/heap/
   8.1% src/backend/access/
   7.7% src/backend/catalog/
  31.5% src/backend/commands/
   6.4% src/include/access/
  26.5% src/include/utils/

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 00f833d210..2a63ab0ea3 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4426,6 +4426,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 6458a9c276..44dc140440 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,8 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		false					/* Change catalog_table_val in
+								 * ATExecSetRelOptions accordingly */
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 998befd2cb..96107b2124 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemov
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 4f2fecb908..1c586b13f5 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -399,6 +399,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.latestRemovedXid = latestRemovedXid;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 75b214824d..1529bf1db0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8167,6 +8167,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	/* nor when there are no tuples to freeze */
 	Assert(ntuples > 0);
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.ntuples = ntuples;
 
@@ -8197,7 +8198,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
  * heap_buffer, if necessary.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId cutoff_xid, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8207,6 +8208,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.cutoff_xid = cutoff_xid;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 9f43bbe25f..c3f9e62cc5 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -421,6 +421,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xlrec.latestRemovedXid = prstate.latestRemovedXid;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index d62761728b..d9abcccd68 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8b96708b3e..8c18ff895a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.latestRemovedFullXid = safexid;
@@ -1357,6 +1358,8 @@ _bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable =
+			IndexIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.latestRemovedXid = latestRemovedXid;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0049630532..1fda720a77 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.onCatalogTable = IndexIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f6b2c9ac71 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -735,6 +738,7 @@ index_create(Relation heapRelation,
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
 	bool		create_storage = !RelFileNumberIsValid(relFileNumber);
+	bool		isusercatalog = false;
 
 	/* constraint flags can only be set when a constraint is requested */
 	Assert((constr_flags == 0) ||
@@ -1014,13 +1018,17 @@ index_create(Relation heapRelation,
 	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
 	 * ----------------
 	 */
+	if (heapRelation->rd_options)
+		isusercatalog = ((StdRdOptions *) (heapRelation)->rd_options)->user_catalog_table;
+
 	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
 						indexInfo,
 						collationObjectId, classObjectId, coloptions,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						isusercatalog);
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7d8a75d23c..593a7c085f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14136,6 +14136,10 @@ ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel, const char *tablespacen
 
 /*
  * Set, reset, or replace reloptions.
+ *
+ * The catalog_table_val value has to match the user_catalog_table value
+ * defined in boolRelOpts[] in reloptions.c. It's indeed used as the default
+ * value to be propagated to the indexes in case of reset.
  */
 static void
 ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
@@ -14151,6 +14155,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = false;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14245,6 +14253,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14271,6 +14293,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 9bbe4c2622..c46c4728e1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,6 +49,7 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ntodelete;		/* number of deleted offsets */
 
@@ -97,6 +98,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool		onCatalogTable;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 59230706bb..9dda97a8d7 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93cf..a091845647 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
@@ -338,6 +339,7 @@ typedef struct xl_heap_freeze_tuple
  */
 typedef struct xl_heap_freeze_page
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint16		ntuples;
 } xl_heap_freeze_page;
@@ -352,6 +354,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool		onCatalogTable;
 	TransactionId cutoff_xid;
 	uint8		flags;
 } xl_heap_visible;
@@ -415,7 +418,7 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *frz);
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer, TransactionId cutoff_xid, uint8 vmflags);
 
 #endif							/* HEAPAM_XLOG_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index dd504d1885..2a00e05560 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool		onCatalogTable;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId latestRemovedFullXid;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool		onCatalogTable;
 	TransactionId latestRemovedXid;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 930ffdd4f7..4808303585 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool		onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId newestRedirectXid;	/* newest XID of removed redirects */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 7dc401cf0d..6f626cc12d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -378,6 +379,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -679,12 +683,41 @@ RelationGetSmgr(Relation rel)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
 	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
 
+/*
+ * IndexIsUserCatalog
+ *		True if index is linked to a user catalog relation.
+ */
+#define IndexIsUserCatalog(relation)											\
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX),				\
+	 (relation)->rd_index->indisusercatalog)
+
+/*
+ * IndexIsAccessibleInLogicalDecoding
+ *		True if we need to log enough information to have access via
+ *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
+ */
+#define IndexIsAccessibleInLogicalDecoding(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX), \
+	 XLogLogicalInfoActive() && \
+	 RelationNeedsWAL(relation) && \
+	 (IsCatalogRelation(relation) || IndexIsUserCatalog(relation)))
+
 /*
  * RelationIsLogicallyLogged
  *		True if we need to log enough information to extract the data from the
-- 
2.34.1

#162

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#161)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 9/30/22 2:11 PM, Drouvot, Bertrand wrote:

Hi,

On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.

Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me that the missing part to be able to tell whether or not an index is for a catalog is the rd_options->user_catalog_table value of its related heap relation.

Then, a way to achieve that could be to:

- Add to Relation a new "heap_rd_options" representing the rd_options of the related heap relation when appropriate

- Trigger the related indexes relcache invalidations when an ATExecSetRelOptions() is triggered on a heap relation

- Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that would make use of the heap_rd_options instead

Does that sound like a valid option to you or do you have another idea in mind to propagate the information whether a relcache entry is for a catalog from the table to the index?

I ended up with the attached proposal to propagate the catalog information to the indexes.

The attached adds a new field "isusercatalog" in pg_index to indicate whether or not the index is linked to a table that has the storage parameter user_catalog_table set to true.

Then it defines new macros, including "IndexIsAccessibleInLogicalDecoding" making use of this new field.

This new macro replaces get_rel_logical_catalog() that was part of the previous patch version.

What do you think about this approach and the attached?

If that sounds reasonable, then I'll add tap tests for it and try to improve the way isusercatalog is propagated to the index(es) in case a reset is done on user_catalog_table on the table (currently in this POC patch, it's hardcoded to "false" which is the default value for user_catalog_table in boolRelOpts[]) (A better approach would be probably to retrieve the value from the table once the reset is done and then propagate it to the index(es).)

Please find attached a rebase to propagate the catalog information to the indexes.
It also takes care of the RESET on user_catalog_table (adding a new Macro "HEAP_DEFAULT_USER_CATALOG_TABLE") and adds a few tests in contrib/test_decoding/sql/ddl.sql.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 2de2b9917d43d598da56c996361d934c45e52df2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 25 Nov 2022 09:42:40 +0000
Subject: [PATCH v27] Add info in WAL records in preparation for logical slot
 conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 29 +++++++++++++
 contrib/test_decoding/sql/ddl.sql       |  7 ++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 14 +++++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++++++-
 src/include/access/gistxlog.h           |  2 +
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++-
 src/include/access/nbtxlog.h            |  2 +
 src/include/access/spgxlog.h            |  1 +
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 34 +++++++++++++++
 20 files changed, 169 insertions(+), 9 deletions(-)
  11.8% contrib/test_decoding/expected/
   7.0% contrib/test_decoding/sql/
   6.8% doc/src/sgml/
   7.8% src/backend/access/heap/
   5.8% src/backend/access/
   6.3% src/backend/catalog/
  23.9% src/backend/commands/
   6.1% src/include/access/
  22.6% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..40cf2f4dc4 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,8 +493,15 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
@@ -506,6 +514,13 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
@@ -519,8 +534,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 -- make sure rewrites don't work
@@ -538,8 +560,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..85ddd4be03 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,19 +276,25 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
@@ -299,6 +305,7 @@ ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9ed2b020b7..18d6b99cac 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4437,6 +4437,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..65fc18554a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..99d5d53d21 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db50376..7d48b9214b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6831,6 +6831,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8248,7 +8249,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8258,6 +8259,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..c2244ccaa3 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..1e5bf7513e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogTable = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogTable = IndexIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..000133aad6 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.onCatalogTable = IndexIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f6b2c9ac71 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -735,6 +738,7 @@ index_create(Relation heapRelation,
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
 	bool		create_storage = !RelFileNumberIsValid(relFileNumber);
+	bool		isusercatalog = false;
 
 	/* constraint flags can only be set when a constraint is requested */
 	Assert((constr_flags == 0) ||
@@ -1014,13 +1018,17 @@ index_create(Relation heapRelation,
 	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
 	 * ----------------
 	 */
+	if (heapRelation->rd_options)
+		isusercatalog = ((StdRdOptions *) (heapRelation)->rd_options)->user_catalog_table;
+
 	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
 						indexInfo,
 						collationObjectId, classObjectId, coloptions,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						isusercatalog);
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 845208d662..faa1fcc07d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14183,6 +14184,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14249,7 +14254,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14277,6 +14281,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14303,6 +14321,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..f924f9f7be 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,6 +49,7 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool        onCatalogTable;
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
 
@@ -97,6 +98,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool        onCatalogTable;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..b16e7038e2 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool        onCatalogTable;
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..dec62e8bed 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool        onCatalogTable;
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
@@ -342,6 +343,7 @@ typedef struct xl_heap_freeze_plan
  */
 typedef struct xl_heap_freeze_page
 {
+	bool        onCatalogTable;
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
 
@@ -359,6 +361,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool        onCatalogTable;
 	TransactionId snapshotConflictHorizon;
 	uint8		flags;
 } xl_heap_visible;
@@ -408,7 +411,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..f8ae3827d3 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogTable;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool        onCatalogTable;
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..8b4ab0e206 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool        onCatalogTable;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..9b77e23c29 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -378,6 +380,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -678,12 +683,41 @@ RelationCloseSmgr(Relation relation)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
 	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
 
+/*
+ * IndexIsUserCatalog
+ *		True if index is linked to a user catalog relation.
+ */
+#define IndexIsUserCatalog(relation)											\
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX),				\
+	 (relation)->rd_index->indisusercatalog)
+
+/*
+ * IndexIsAccessibleInLogicalDecoding
+ *		True if we need to log enough information to have access via
+ *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
+ */
+#define IndexIsAccessibleInLogicalDecoding(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX), \
+	 XLogLogicalInfoActive() && \
+	 RelationNeedsWAL(relation) && \
+	 (IsCatalogRelation(relation) || IndexIsUserCatalog(relation)))
+
 /*
  * RelationIsLogicallyLogged
  *		True if we need to log enough information to extract the data from the
-- 
2.34.1

#163

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#162)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 11/25/22 11:26 AM, Drouvot, Bertrand wrote:

Hi,

On 9/30/22 2:11 PM, Drouvot, Bertrand wrote:

Hi,

On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.

Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me that the missing part to be able to tell whether or not an index is for a catalog is the rd_options->user_catalog_table value of its related heap relation.

Then, a way to achieve that could be to:

- Add to Relation a new "heap_rd_options" representing the rd_options of the related heap relation when appropriate

- Trigger the related indexes relcache invalidations when an ATExecSetRelOptions() is triggered on a heap relation

- Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that would make use of the heap_rd_options instead

Does that sound like a valid option to you or do you have another idea in mind to propagate the information whether a relcache entry is for a catalog from the table to the index?

I ended up with the attached proposal to propagate the catalog information to the indexes.

The attached adds a new field "isusercatalog" in pg_index to indicate whether or not the index is linked to a table that has the storage parameter user_catalog_table set to true.

Then it defines new macros, including "IndexIsAccessibleInLogicalDecoding" making use of this new field.

This new macro replaces get_rel_logical_catalog() that was part of the previous patch version.

What do you think about this approach and the attached?

If that sounds reasonable, then I'll add tap tests for it and try to improve the way isusercatalog is propagated to the index(es) in case a reset is done on user_catalog_table on the table (currently in this POC patch, it's hardcoded to "false" which is the default value for user_catalog_table in boolRelOpts[]) (A better approach would be probably to retrieve the value from the table once the reset is done and then propagate it to the index(es).)

Please find attached a rebase to propagate the catalog information to the indexes.
It also takes care of the RESET on user_catalog_table (adding a new Macro "HEAP_DEFAULT_USER_CATALOG_TABLE") and adds a few tests in contrib/test_decoding/sql/ddl.sql.

Please find attached a new patch series:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
v27-0002-Handle-logical-slot-conflicts-on-standby.patch
v27-0003-Allow-logical-decoding-on-standby.patch
v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
v27-0005-Doc-changes-describing-details-about-logical-dec.patch
v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

with the previous comments addressed, means mainly:

1/ don't call table_open() in low-level functions in 0001: this is done with a new field "isusercatalog" in pg_index to indicate whether or not the index is linked to a table that has the storage parameter user_catalog_table set to true (we may want to make this field "invisible" though). This new field is then used in the new IndexIsAccessibleInLogicalDecoding Macro (through IndexIsUserCatalog).

2/ Renaming the new field generated in the xlog record (to arrange conflict handling) from "onCatalogTable" to "onCatalogAccessibleInLogicalDecoding" to avoid any confusion (see 0001).

3/ Making sure that "currTLI" is the current one in logical_read_xlog_page() (see 0003).

4/ Fixing Walsender/startup process corner case: It's done in 0006 (I thought it is better to keep the other patches purely "feature" related and to address this corner case separately to ease the review). The fix is making use of a new

condition variable "replayedCV" so that the startup process can broadcast the walsender(s) once a replay is done.

Remarks:

- The new confl_active_logicalslot field added in pg_stat_database_conflicts (see 0002) is incremented only if the slot being invalidated is active (I think it makes more sense in regard to the other fields too). In all the cases (active/not active) the slot invalidation is reported in the logfile. The documentation update mentions this behavior (see 0002).

- LogStandbySnapshot() being moved outside of the loop in ReplicationSlotReserveWal() (see 0003), is a proposal made by Andres in [1]/messages/by-id/20210406180231.qsnkyrgrm7gtxb73@alap3.anarazel.de and I think it makes sense.

- Tap tests (see 0004) are covering: tests that the logical decoding on standby behaves correctly, conflicts, slots invalidations, standby promotion.

Looking forward to your feedback,

[1]: /messages/by-id/20210406180231.qsnkyrgrm7gtxb73@alap3.anarazel.de

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From d59f72dc62cf2922800a99ac001fb7c9aeaaab72 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 1 Dec 2022 12:34:59 +0000
Subject: [PATCH v27 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 41ffc57da9..0337fdc77a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4911,3 +4920,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v27-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v27-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 018c7526fe21a0df01e27450b80a41023f31750f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 1 Dec 2022 10:45:09 +0000
Subject: [PATCH v27 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v27-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v27-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 3212f47c0a73425342c22ecd56dcdf5b32b0d49a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 1 Dec 2022 10:28:18 +0000
Subject: [PATCH v27 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 500 ++++++++++++++++++
 3 files changed, 538 insertions(+)
   5.9% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..d68ee9b663 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..e77f9a6436
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,500 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 38;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level to be at least logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v27-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v27-0003-Allow-logical-decoding-on-standby.patchDownload

From 18798f28008cf8aada4c4b84edaf8880066b548a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 1 Dec 2022 10:26:14 +0000
Subject: [PATCH v27 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4503fb36d..00f021942f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4471,6 +4471,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v27-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v27-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From f90eaf818fd17ff809a54473e787ac5a6d10855f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 1 Dec 2022 10:15:19 +0000
Subject: [PATCH v27 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |  16 ++
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 331 insertions(+), 5 deletions(-)
   3.7% doc/src/sgml/
   5.0% src/backend/access/transam/
   4.4% src/backend/access/
   3.7% src/backend/replication/logical/
  56.3% src/backend/replication/
   7.2% src/backend/storage/ipc/
   7.6% src/backend/tcop/
   4.1% src/backend/utils/
   5.8% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b41b4e2a90..68d1c2479d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b747900a45..75818266a6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..61565f905d 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b42a6bfa61..a79f010286 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8696,6 +8696,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -8865,6 +8866,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -9120,6 +9122,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..6289f8d250 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..6ad467117f 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..e4503fb36d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7967,6 +7967,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..553953959d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool onCatalogAccessibleInLogicalDecoding,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogAccessibleInLogicalDecoding)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool onCatalogAccessibleInLogicalDecoding,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											onCatalogAccessibleInLogicalDecoding,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3082093d1e..d900eed9f4 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2470,6 +2470,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3039,6 +3042,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..760d2cd882 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->n_conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->n_conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->n_conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(n_conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(n_conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(n_conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(n_conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(n_conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(n_conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ae3365d917..749912c1f2 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1432,6 +1432,21 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_db_conflict_logicalslot(PG_FUNCTION_ARGS)
+{
+	Oid			dbid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatDBEntry *dbentry;
+
+	if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+		result = 0;
+	else
+		result = (int64) (dbentry->n_conflict_logicalslot);
+
+	PG_RETURN_INT64(result);
+}
+
 Datum
 pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
 {
@@ -1475,6 +1490,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->n_conflict_tablespace +
 						  dbentry->n_conflict_lock +
 						  dbentry->n_conflict_snapshot +
+						  dbentry->n_conflict_logicalslot +
 						  dbentry->n_conflict_bufferpin +
 						  dbentry->n_conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..105b01d68e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5539,6 +5539,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..0f4ac12f89 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter n_conflict_tablespace;
 	PgStat_Counter n_conflict_lock;
 	PgStat_Counter n_conflict_snapshot;
+	PgStat_Counter n_conflict_logicalslot;
 	PgStat_Counter n_conflict_bufferpin;
 	PgStat_Counter n_conflict_startup_deadlock;
 	PgStat_Counter n_temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..be86a1246f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool onCatalogAccessibleInLogicalDecoding,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool onCatalogAccessibleInLogicalDecoding,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 37c1c86473..2eed08619d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 345dcf6d40919a2f29e191f9e272145c0c204ca1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 1 Dec 2022 10:14:30 +0000
Subject: [PATCH v27 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogAccessibleInLogicalDecoding in such WAL records, that is true
for catalog tables, so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 29 +++++++++++++
 contrib/test_decoding/sql/ddl.sql       |  7 ++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 14 +++++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++++++-
 src/include/access/gistxlog.h           |  2 +
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++-
 src/include/access/nbtxlog.h            |  2 +
 src/include/access/spgxlog.h            |  1 +
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 34 +++++++++++++++
 20 files changed, 169 insertions(+), 9 deletions(-)
  11.2% contrib/test_decoding/expected/
   6.6% contrib/test_decoding/sql/
   6.5% doc/src/sgml/
   8.3% src/backend/access/heap/
   7.1% src/backend/access/
   6.0% src/backend/catalog/
  22.6% src/backend/commands/
   8.5% src/include/access/
  21.4% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..40cf2f4dc4 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,8 +493,15 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
@@ -506,6 +514,13 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
@@ -519,8 +534,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 -- make sure rewrites don't work
@@ -538,8 +560,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..85ddd4be03 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,19 +276,25 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
@@ -299,6 +305,7 @@ ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9ed2b020b7..18d6b99cac 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4437,6 +4437,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..b747900a45 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..7e59a384af 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db50376..b42a6bfa61 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6831,6 +6831,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8248,7 +8249,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8258,6 +8259,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..aae78f7144 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..88773c0e41 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..be23907687 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f6b2c9ac71 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -735,6 +738,7 @@ index_create(Relation heapRelation,
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
 	bool		create_storage = !RelFileNumberIsValid(relFileNumber);
+	bool		isusercatalog = false;
 
 	/* constraint flags can only be set when a constraint is requested */
 	Assert((constr_flags == 0) ||
@@ -1014,13 +1018,17 @@ index_create(Relation heapRelation,
 	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
 	 * ----------------
 	 */
+	if (heapRelation->rd_options)
+		isusercatalog = ((StdRdOptions *) (heapRelation)->rd_options)->user_catalog_table;
+
 	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
 						indexInfo,
 						collationObjectId, classObjectId, coloptions,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						isusercatalog);
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 10c1955884..ceb0f8f8a6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14189,6 +14190,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14255,7 +14260,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14283,6 +14287,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14309,6 +14327,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..40c3ea8f71 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,6 +49,7 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
 
@@ -97,6 +98,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..2eae644e89 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..3d814a5ae2 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
@@ -342,6 +343,7 @@ typedef struct xl_heap_freeze_plan
  */
 typedef struct xl_heap_freeze_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
 
@@ -359,6 +361,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint8		flags;
 } xl_heap_visible;
@@ -408,7 +411,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..1931f2bbbc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..04fe4a4a52 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..9b77e23c29 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -378,6 +380,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -678,12 +683,41 @@ RelationCloseSmgr(Relation relation)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
 	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
 
+/*
+ * IndexIsUserCatalog
+ *		True if index is linked to a user catalog relation.
+ */
+#define IndexIsUserCatalog(relation)											\
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX),				\
+	 (relation)->rd_index->indisusercatalog)
+
+/*
+ * IndexIsAccessibleInLogicalDecoding
+ *		True if we need to log enough information to have access via
+ *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
+ */
+#define IndexIsAccessibleInLogicalDecoding(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX), \
+	 XLogLogicalInfoActive() && \
+	 RelationNeedsWAL(relation) && \
+	 (IsCatalogRelation(relation) || IndexIsUserCatalog(relation)))
+
 /*
  * RelationIsLogicallyLogged
  *		True if we need to log enough information to extract the data from the
-- 
2.34.1

#164

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#163)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

Hi,
Please find attached a new patch series:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
v27-0002-Handle-logical-slot-conflicts-on-standby.patch
v27-0003-Allow-logical-decoding-on-standby.patch
v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
v27-0005-Doc-changes-describing-details-about-logical-dec.patch
v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

with the previous comments addressed, means mainly:

1/ don't call table_open() in low-level functions in 0001: this is done with a new field "isusercatalog" in pg_index to indicate whether or not the index is linked to a table that has the storage parameter user_catalog_table set to true (we may want to make this field "invisible" though). This new field is then used in the new IndexIsAccessibleInLogicalDecoding Macro (through IndexIsUserCatalog).

2/ Renaming the new field generated in the xlog record (to arrange conflict handling) from "onCatalogTable" to "onCatalogAccessibleInLogicalDecoding" to avoid any confusion (see 0001).

3/ Making sure that "currTLI" is the current one in logical_read_xlog_page() (see 0003).

4/ Fixing Walsender/startup process corner case: It's done in 0006 (I thought it is better to keep the other patches purely "feature" related and to address this corner case separately to ease the review). The fix is making use of a new

condition variable "replayedCV" so that the startup process can broadcast the walsender(s) once a replay is done.

Remarks:

- The new confl_active_logicalslot field added in pg_stat_database_conflicts (see 0002) is incremented only if the slot being invalidated is active (I think it makes more sense in regard to the other fields too). In all the cases (active/not active) the slot invalidation is reported in the logfile. The documentation update mentions this behavior (see 0002).

- LogStandbySnapshot() being moved outside of the loop in ReplicationSlotReserveWal() (see 0003), is a proposal made by Andres in [1] and I think it makes sense.

- Tap tests (see 0004) are covering: tests that the logical decoding on standby behaves correctly, conflicts, slots invalidations, standby promotion.

Looking forward to your feedback,

[1]: /messages/by-id/20210406180231.qsnkyrgrm7gtxb73@alap3.anarazel.de

Please find attached v28 (mandatory rebase due to 8018ffbf58).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v28-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v28-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 2cf3a34eb2cd882fee56410853ba4eeb6c1540ac Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 7 Dec 2022 08:19:02 +0000
Subject: [PATCH v28 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 97b882564f..e958545e9b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4914,3 +4923,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v28-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v28-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 444567c40f653ac869e0a01029e77be4905f32b9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 7 Dec 2022 08:18:13 +0000
Subject: [PATCH v28 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v28-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v28-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 5fc1b19a70c0a7063a086ded40e6572b117ea6bd Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 7 Dec 2022 08:17:17 +0000
Subject: [PATCH v28 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 500 ++++++++++++++++++
 3 files changed, 538 insertions(+)
   5.9% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..d68ee9b663 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..e77f9a6436
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,500 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 38;
+use Time::HiRes qw(usleep);
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+	my $return;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/conflict with recovery/, 'slot have been dropped');
+	}
+
+	return 0;
+}
+
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\""),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+# We are not able to read from the slot as it has been invalidated
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = get_log_size($node_standby);
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+ok( find_in_log(
+   $node_standby,
+  "logical decoding on standby requires wal_level to be at least logical on master", $logstart),
+  'cannot start replication because wal_level < logical on master');
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+usleep(1000_000);
+
+# as the slot has been invalidated we should not be able to read
+ok( find_in_log(
+   $node_standby,
+  "cannot read from logical replication slot \"activeslot\"", $logstart),
+  'cannot read from logical replication slot');
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v28-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v28-0003-Allow-logical-decoding-on-standby.patchDownload

From 5334cfc41fee7e0c783c41163643e8b902720034 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 7 Dec 2022 08:16:15 +0000
Subject: [PATCH v28 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4503fb36d..00f021942f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4471,6 +4471,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v28-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v28-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 29ea3b701a87409d343de862f5d2a3ba1ffcae49 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 7 Dec 2022 08:13:26 +0000
Subject: [PATCH v28 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.8% doc/src/sgml/
   5.1% src/backend/access/transam/
   4.5% src/backend/access/
   3.8% src/backend/replication/logical/
  57.4% src/backend/replication/
   7.3% src/backend/storage/ipc/
   7.8% src/backend/tcop/
   3.3% src/backend/
   5.9% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 11a8ebe5ec..19b50960ec 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b747900a45..75818266a6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..61565f905d 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b42a6bfa61..a79f010286 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8696,6 +8696,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -8865,6 +8866,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -9120,6 +9122,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..6289f8d250 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..6ad467117f 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..e4503fb36d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7967,6 +7967,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..553953959d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool onCatalogAccessibleInLogicalDecoding,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogAccessibleInLogicalDecoding)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool onCatalogAccessibleInLogicalDecoding,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											onCatalogAccessibleInLogicalDecoding,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3082093d1e..d900eed9f4 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2470,6 +2470,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3039,6 +3042,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 25a159b5e5..6a7647d030 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1017,6 +1017,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit);
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback);
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot);
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1050,6 +1052,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..105b01d68e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5539,6 +5539,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..be86a1246f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool onCatalogAccessibleInLogicalDecoding,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool onCatalogAccessibleInLogicalDecoding,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v28-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v28-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 35f0202f5dfa19468f979a49caba670abe4e63ff Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 7 Dec 2022 08:02:37 +0000
Subject: [PATCH v28 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogAccessibleInLogicalDecoding in such WAL records, that is true
for catalog tables, so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 29 +++++++++++++
 contrib/test_decoding/sql/ddl.sql       |  7 ++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 14 +++++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++++++-
 src/include/access/gistxlog.h           |  2 +
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++-
 src/include/access/nbtxlog.h            |  2 +
 src/include/access/spgxlog.h            |  1 +
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 34 +++++++++++++++
 20 files changed, 169 insertions(+), 9 deletions(-)
  11.2% contrib/test_decoding/expected/
   6.6% contrib/test_decoding/sql/
   6.5% doc/src/sgml/
   8.3% src/backend/access/heap/
   7.1% src/backend/access/
   6.0% src/backend/catalog/
  22.6% src/backend/commands/
   8.5% src/include/access/
  21.4% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..40cf2f4dc4 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,8 +493,15 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
@@ -506,6 +514,13 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
@@ -519,8 +534,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 -- make sure rewrites don't work
@@ -538,8 +560,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..85ddd4be03 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,19 +276,25 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
@@ -299,6 +305,7 @@ ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9ed2b020b7..18d6b99cac 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4437,6 +4437,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..b747900a45 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..7e59a384af 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db50376..b42a6bfa61 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6831,6 +6831,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8248,7 +8249,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8258,6 +8259,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..aae78f7144 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..88773c0e41 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..be23907687 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f6b2c9ac71 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -735,6 +738,7 @@ index_create(Relation heapRelation,
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
 	bool		create_storage = !RelFileNumberIsValid(relFileNumber);
+	bool		isusercatalog = false;
 
 	/* constraint flags can only be set when a constraint is requested */
 	Assert((constr_flags == 0) ||
@@ -1014,13 +1018,17 @@ index_create(Relation heapRelation,
 	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
 	 * ----------------
 	 */
+	if (heapRelation->rd_options)
+		isusercatalog = ((StdRdOptions *) (heapRelation)->rd_options)->user_catalog_table;
+
 	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
 						indexInfo,
 						collationObjectId, classObjectId, coloptions,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						isusercatalog);
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 10c1955884..ceb0f8f8a6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14189,6 +14190,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14255,7 +14260,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14283,6 +14287,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14309,6 +14327,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..40c3ea8f71 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,6 +49,7 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
 
@@ -97,6 +98,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..2eae644e89 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..3d814a5ae2 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
@@ -342,6 +343,7 @@ typedef struct xl_heap_freeze_plan
  */
 typedef struct xl_heap_freeze_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
 
@@ -359,6 +361,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint8		flags;
 } xl_heap_visible;
@@ -408,7 +411,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..1931f2bbbc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..04fe4a4a52 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..9b77e23c29 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -378,6 +380,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -678,12 +683,41 @@ RelationCloseSmgr(Relation relation)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
 	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
 
+/*
+ * IndexIsUserCatalog
+ *		True if index is linked to a user catalog relation.
+ */
+#define IndexIsUserCatalog(relation)											\
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX),				\
+	 (relation)->rd_index->indisusercatalog)
+
+/*
+ * IndexIsAccessibleInLogicalDecoding
+ *		True if we need to log enough information to have access via
+ *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
+ */
+#define IndexIsAccessibleInLogicalDecoding(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX), \
+	 XLogLogicalInfoActive() && \
+	 RelationNeedsWAL(relation) && \
+	 (IsCatalogRelation(relation) || IndexIsUserCatalog(relation)))
+
 /*
  * RelationIsLogicallyLogged
  *		True if we need to log enough information to extract the data from the
-- 
2.34.1

#165

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Drouvot, Bertrand (#164)

Re: Minimal logical decoding on standbys

Hi,

On 2022-12-07 10:00:25 +0100, Drouvot, Bertrand wrote:

Please find attached a new patch series:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
v27-0002-Handle-logical-slot-conflicts-on-standby.patch
v27-0003-Allow-logical-decoding-on-standby.patch
v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
v27-0005-Doc-changes-describing-details-about-logical-dec.patch
v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

This failed on cfbot [1]https://cirrus-ci.com/task/5092676671373312. The tap output [2]https://api.cirrus-ci.com/v1/artifact/task/5092676671373312/testrun/build/testrun/recovery/034_standby_logical_decoding/log/regress_log_034_standby_logical_decoding has the following bit:

[09:48:56.216](5.979s) not ok 26 - cannot read from logical replication slot
[09:48:56.223](0.007s) # Failed test 'cannot read from logical replication slot'
# at C:/cirrus/src/test/recovery/t/034_standby_logical_decoding.pl line 422.
...
Warning: unable to close filehandle GEN150 properly: Bad file descriptor during global destruction.
Warning: unable to close filehandle GEN155 properly: Bad file descriptor during global destruction.

The "unable to close filehandle" stuff in my experience indicates an IPC::Run
process that wasn't ended before the tap test ended.

Greetings,

Andres Freund

[1]: https://cirrus-ci.com/task/5092676671373312
[2]: https://api.cirrus-ci.com/v1/artifact/task/5092676671373312/testrun/build/testrun/recovery/034_standby_logical_decoding/log/regress_log_034_standby_logical_decoding

#166

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Andres Freund (#165)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/7/22 6:58 PM, Andres Freund wrote:

Hi,

On 2022-12-07 10:00:25 +0100, Drouvot, Bertrand wrote:

Please find attached a new patch series:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
v27-0002-Handle-logical-slot-conflicts-on-standby.patch
v27-0003-Allow-logical-decoding-on-standby.patch
v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
v27-0005-Doc-changes-describing-details-about-logical-dec.patch
v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

This failed on cfbot [1]. The tap output [2] has the following bit:

[09:48:56.216](5.979s) not ok 26 - cannot read from logical replication slot
[09:48:56.223](0.007s) # Failed test 'cannot read from logical replication slot'
# at C:/cirrus/src/test/recovery/t/034_standby_logical_decoding.pl line 422.
...
Warning: unable to close filehandle GEN150 properly: Bad file descriptor during global destruction.
Warning: unable to close filehandle GEN155 properly: Bad file descriptor during global destruction.

The "unable to close filehandle" stuff in my experience indicates an IPC::Run
process that wasn't ended before the tap test ended.

Greetings,

Andres Freund

[1] https://cirrus-ci.com/task/5092676671373312
[2] https://api.cirrus-ci.com/v1/artifact/task/5092676671373312/testrun/build/testrun/recovery/034_standby_logical_decoding/log/regress_log_034_standby_logical_decoding

Thanks for pointing out!

Please find attached V29 addressing this "Windows perl" issue: V29 changes the way the slot invalidation is tested and adds a "handle->finish". That looks ok now (I launched several successful consecutive tests on my enabled cirrus-ci repository).

V29 differs from V28 only in 0004 to workaround the above "Windows perl" issue.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v29-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v29-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From ef473f252c6e03470c975c3cc2f93482e3d98473 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Dec 2022 10:24:18 +0000
Subject: [PATCH v29 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 97b882564f..e958545e9b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4914,3 +4923,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v29-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v29-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 6253e49a65a43db2416094de9d81cbfbde60aab7 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Dec 2022 10:23:28 +0000
Subject: [PATCH v29 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v29-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v29-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 7ad666a5cd17d04b53f41f6393da43901db704ee Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Dec 2022 10:22:28 +0000
Subject: [PATCH v29 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..d68ee9b663 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v29-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v29-0003-Allow-logical-decoding-on-standby.patchDownload

From f9c939635618d0811c6d9683a3f1c016acab09bc Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Dec 2022 10:20:35 +0000
Subject: [PATCH v29 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4503fb36d..00f021942f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4471,6 +4471,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v29-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v29-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From a74cdf20687d81205bbce39758ff971900723cbd Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Dec 2022 10:19:27 +0000
Subject: [PATCH v29 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.8% doc/src/sgml/
   5.1% src/backend/access/transam/
   4.5% src/backend/access/
   3.8% src/backend/replication/logical/
  57.4% src/backend/replication/
   7.3% src/backend/storage/ipc/
   7.8% src/backend/tcop/
   3.3% src/backend/
   5.9% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 11a8ebe5ec..19b50960ec 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b747900a45..75818266a6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..61565f905d 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b42a6bfa61..a79f010286 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8696,6 +8696,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -8865,6 +8866,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -9120,6 +9122,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..6289f8d250 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..6ad467117f 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..e4503fb36d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7967,6 +7967,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..553953959d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool onCatalogAccessibleInLogicalDecoding,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogAccessibleInLogicalDecoding)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool onCatalogAccessibleInLogicalDecoding,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											onCatalogAccessibleInLogicalDecoding,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3082093d1e..d900eed9f4 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2470,6 +2470,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3039,6 +3042,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 25a159b5e5..6a7647d030 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1017,6 +1017,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit);
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback);
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot);
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1050,6 +1052,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..105b01d68e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5539,6 +5539,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..be86a1246f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool onCatalogAccessibleInLogicalDecoding,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool onCatalogAccessibleInLogicalDecoding,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v29-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v29-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 5636728138afd0f1383e0e73c25f4b6bd02eb12e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 8 Dec 2022 10:18:37 +0000
Subject: [PATCH v29 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogAccessibleInLogicalDecoding in such WAL records, that is true
for catalog tables, so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 29 +++++++++++++
 contrib/test_decoding/sql/ddl.sql       |  7 ++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 14 +++++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++++++-
 src/include/access/gistxlog.h           |  2 +
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++-
 src/include/access/nbtxlog.h            |  2 +
 src/include/access/spgxlog.h            |  1 +
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 34 +++++++++++++++
 20 files changed, 169 insertions(+), 9 deletions(-)
  11.2% contrib/test_decoding/expected/
   6.6% contrib/test_decoding/sql/
   6.5% doc/src/sgml/
   8.3% src/backend/access/heap/
   7.1% src/backend/access/
   6.0% src/backend/catalog/
  22.6% src/backend/commands/
   8.5% src/include/access/
  21.4% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..40cf2f4dc4 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,8 +493,15 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
@@ -506,6 +514,13 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
@@ -519,8 +534,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 -- make sure rewrites don't work
@@ -538,8 +560,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..85ddd4be03 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,19 +276,25 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
@@ -299,6 +305,7 @@ ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9ed2b020b7..18d6b99cac 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4437,6 +4437,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..b747900a45 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..7e59a384af 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db50376..b42a6bfa61 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6831,6 +6831,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8248,7 +8249,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8258,6 +8259,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..aae78f7144 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..88773c0e41 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..be23907687 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f6b2c9ac71 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -735,6 +738,7 @@ index_create(Relation heapRelation,
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
 	bool		create_storage = !RelFileNumberIsValid(relFileNumber);
+	bool		isusercatalog = false;
 
 	/* constraint flags can only be set when a constraint is requested */
 	Assert((constr_flags == 0) ||
@@ -1014,13 +1018,17 @@ index_create(Relation heapRelation,
 	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
 	 * ----------------
 	 */
+	if (heapRelation->rd_options)
+		isusercatalog = ((StdRdOptions *) (heapRelation)->rd_options)->user_catalog_table;
+
 	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
 						indexInfo,
 						collationObjectId, classObjectId, coloptions,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						isusercatalog);
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ee88e87d76..b56348e11a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14189,6 +14190,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14255,7 +14260,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14283,6 +14287,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14309,6 +14327,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..40c3ea8f71 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,6 +49,7 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
 
@@ -97,6 +98,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..2eae644e89 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..3d814a5ae2 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
@@ -342,6 +343,7 @@ typedef struct xl_heap_freeze_plan
  */
 typedef struct xl_heap_freeze_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
 
@@ -359,6 +361,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint8		flags;
 } xl_heap_visible;
@@ -408,7 +411,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..1931f2bbbc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..04fe4a4a52 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..9b77e23c29 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -378,6 +380,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -678,12 +683,41 @@ RelationCloseSmgr(Relation relation)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
 	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
 
+/*
+ * IndexIsUserCatalog
+ *		True if index is linked to a user catalog relation.
+ */
+#define IndexIsUserCatalog(relation)											\
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX),				\
+	 (relation)->rd_index->indisusercatalog)
+
+/*
+ * IndexIsAccessibleInLogicalDecoding
+ *		True if we need to log enough information to have access via
+ *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
+ */
+#define IndexIsAccessibleInLogicalDecoding(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX), \
+	 XLogLogicalInfoActive() && \
+	 RelationNeedsWAL(relation) && \
+	 (IsCatalogRelation(relation) || IndexIsUserCatalog(relation)))
+
 /*
  * RelationIsLogicallyLogged
  *		True if we need to log enough information to extract the data from the
-- 
2.34.1

#167

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#166)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/8/22 12:07 PM, Drouvot, Bertrand wrote:

Hi,

On 12/7/22 6:58 PM, Andres Freund wrote:

Hi,

On 2022-12-07 10:00:25 +0100, Drouvot, Bertrand wrote:

Please find attached a new patch series:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
v27-0002-Handle-logical-slot-conflicts-on-standby.patch
v27-0003-Allow-logical-decoding-on-standby.patch
v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
v27-0005-Doc-changes-describing-details-about-logical-dec.patch
v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

This failed on cfbot [1]. The tap output [2] has the following bit:

[09:48:56.216](5.979s) not ok 26 - cannot read from logical replication slot
[09:48:56.223](0.007s) # Failed test 'cannot read from logical replication slot'
# at C:/cirrus/src/test/recovery/t/034_standby_logical_decoding.pl line 422.
...
Warning: unable to close filehandle GEN150 properly: Bad file descriptor during global destruction.
Warning: unable to close filehandle GEN155 properly: Bad file descriptor during global destruction.

The "unable to close filehandle" stuff in my experience indicates an IPC::Run
process that wasn't ended before the tap test ended.

Greetings,

Andres Freund

[1] https://cirrus-ci.com/task/5092676671373312
[2] https://api.cirrus-ci.com/v1/artifact/task/5092676671373312/testrun/build/testrun/recovery/034_standby_logical_decoding/log/regress_log_034_standby_logical_decoding

Thanks for pointing out!

Please find attached V29 addressing this "Windows perl" issue: V29 changes the way the slot invalidation is tested and adds a "handle->finish". That looks ok now (I launched several successful consecutive tests on my enabled cirrus-ci repository).

V29 differs from V28 only in 0004 to workaround the above "Windows perl" issue.

Regards,

Attaching V30, mandatory rebase due to 66dcb09246.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v30-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v30-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From d592f165e8fdd7ecc3dd4e99a443572844ea0cba Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Sat, 10 Dec 2022 07:23:39 +0000
Subject: [PATCH v30 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d5a81f9d83..ac8b169ab5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v30-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v30-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 4dce3d4a3a41c5c6a2c2b73db0a7658dc48363d2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Sat, 10 Dec 2022 07:22:40 +0000
Subject: [PATCH v30 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v30-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v30-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 29af251350d8336f34034efbefbb621e9c5a2277 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Sat, 10 Dec 2022 07:21:51 +0000
Subject: [PATCH v30 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..d68ee9b663 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v30-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v30-0003-Allow-logical-decoding-on-standby.patchDownload

From 2654932c3d02d6e7a822af9d77777ce7f64dbc93 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Sat, 10 Dec 2022 07:20:32 +0000
Subject: [PATCH v30 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4503fb36d..00f021942f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4471,6 +4471,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v30-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v30-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 9e95fbeaab931d3027ed99ebde3a07c8abac571c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Sat, 10 Dec 2022 07:19:41 +0000
Subject: [PATCH v30 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.8% doc/src/sgml/
   5.1% src/backend/access/transam/
   4.5% src/backend/access/
   3.8% src/backend/replication/logical/
  57.4% src/backend/replication/
   7.3% src/backend/storage/ipc/
   7.8% src/backend/tcop/
   3.3% src/backend/
   5.9% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 11a8ebe5ec..19b50960ec 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b747900a45..75818266a6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..61565f905d 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b42a6bfa61..a79f010286 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8696,6 +8696,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -8865,6 +8866,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -9120,6 +9122,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..6289f8d250 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..6ad467117f 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..e4503fb36d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7967,6 +7967,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..553953959d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool onCatalogAccessibleInLogicalDecoding,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogAccessibleInLogicalDecoding)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool onCatalogAccessibleInLogicalDecoding,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											onCatalogAccessibleInLogicalDecoding,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3082093d1e..d900eed9f4 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2470,6 +2470,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3039,6 +3042,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 04a5a99002..95351c927b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1017,6 +1017,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1050,6 +1052,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 719599649a..6284c9790c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5539,6 +5539,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..be86a1246f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool onCatalogAccessibleInLogicalDecoding,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool onCatalogAccessibleInLogicalDecoding,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v30-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v30-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 31b99f45e568ae4d1acd9a69aac863afa4ea2bb1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Sat, 10 Dec 2022 07:15:55 +0000
Subject: [PATCH v30 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogAccessibleInLogicalDecoding in such WAL records, that is true
for catalog tables, so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 29 +++++++++++++
 contrib/test_decoding/sql/ddl.sql       |  7 ++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  4 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 14 +++++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++++++-
 src/include/access/gistxlog.h           |  2 +
 src/include/access/hash_xlog.h          |  1 +
 src/include/access/heapam_xlog.h        |  5 ++-
 src/include/access/nbtxlog.h            |  2 +
 src/include/access/spgxlog.h            |  1 +
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 34 +++++++++++++++
 20 files changed, 169 insertions(+), 9 deletions(-)
  11.2% contrib/test_decoding/expected/
   6.6% contrib/test_decoding/sql/
   6.5% doc/src/sgml/
   8.3% src/backend/access/heap/
   7.1% src/backend/access/
   6.0% src/backend/catalog/
  22.6% src/backend/commands/
   8.5% src/include/access/
  21.4% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..40cf2f4dc4 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,8 +493,15 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
@@ -506,6 +514,13 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
@@ -519,8 +534,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 -- make sure rewrites don't work
@@ -538,8 +560,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..85ddd4be03 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,19 +276,25 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
@@ -299,6 +305,7 @@ ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..459539b761 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..b747900a45 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..7e59a384af 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db50376..b42a6bfa61 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6831,6 +6831,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8248,7 +8249,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8258,6 +8259,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 	Assert(BufferIsValid(heap_buffer));
 	Assert(BufferIsValid(vm_buffer));
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
 	XLogBeginInsert();
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..aae78f7144 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..88773c0e41 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..be23907687 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = IndexIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f6b2c9ac71 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -735,6 +738,7 @@ index_create(Relation heapRelation,
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
 	bool		create_storage = !RelFileNumberIsValid(relFileNumber);
+	bool		isusercatalog = false;
 
 	/* constraint flags can only be set when a constraint is requested */
 	Assert((constr_flags == 0) ||
@@ -1014,13 +1018,17 @@ index_create(Relation heapRelation,
 	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
 	 * ----------------
 	 */
+	if (heapRelation->rd_options)
+		isusercatalog = ((StdRdOptions *) (heapRelation)->rd_options)->user_catalog_table;
+
 	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
 						indexInfo,
 						collationObjectId, classObjectId, coloptions,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						isusercatalog);
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ee88e87d76..b56348e11a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14189,6 +14190,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14255,7 +14260,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14283,6 +14287,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14309,6 +14327,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..40c3ea8f71 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -49,6 +49,7 @@ typedef struct gistxlogPageUpdate
  */
 typedef struct gistxlogDelete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
 
@@ -97,6 +98,7 @@ typedef struct gistxlogPageDelete
  */
 typedef struct gistxlogPageReuse
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..2eae644e89 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -250,6 +250,7 @@ typedef struct xl_hash_init_bitmap_page
  */
 typedef struct xl_hash_vacuum_one_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..3d814a5ae2 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -242,6 +242,7 @@ typedef struct xl_heap_update
  */
 typedef struct xl_heap_prune
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
@@ -342,6 +343,7 @@ typedef struct xl_heap_freeze_plan
  */
 typedef struct xl_heap_freeze_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
 
@@ -359,6 +361,7 @@ typedef struct xl_heap_freeze_page
  */
 typedef struct xl_heap_visible
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint8		flags;
 } xl_heap_visible;
@@ -408,7 +411,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..1931f2bbbc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -185,6 +185,7 @@ typedef struct xl_btree_dedup
  */
 typedef struct xl_btree_reuse_page
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
@@ -232,6 +233,7 @@ typedef struct xl_btree_vacuum
 
 typedef struct xl_btree_delete
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..04fe4a4a52 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -237,6 +237,7 @@ typedef struct spgxlogVacuumRoot
 
 typedef struct spgxlogVacuumRedirect
 {
+	bool        onCatalogAccessibleInLogicalDecoding;
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..9b77e23c29 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -378,6 +380,9 @@ typedef struct StdRdOptions
  * RelationIsUsedAsCatalogTable
  *		Returns whether the relation should be treated as a catalog table
  *		from the pov of logical decoding.  Note multiple eval of argument!
+ *		This definition should not invoke anything that performs catalog
+ *		access; otherwise it may cause infinite recursion. Check the comments
+ *		in RelationIsAccessibleInLogicalDecoding() for details.
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
 	((relation)->rd_options && \
@@ -678,12 +683,41 @@ RelationCloseSmgr(Relation relation)
  * RelationIsAccessibleInLogicalDecoding
  *		True if we need to log enough information to have access via
  *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
 	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
 
+/*
+ * IndexIsUserCatalog
+ *		True if index is linked to a user catalog relation.
+ */
+#define IndexIsUserCatalog(relation)											\
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX),				\
+	 (relation)->rd_index->indisusercatalog)
+
+/*
+ * IndexIsAccessibleInLogicalDecoding
+ *		True if we need to log enough information to have access via
+ *		decoding snapshot.
+ *		This definition should not invoke anything that performs catalog
+ *		access. Otherwise, e.g. logging a WAL entry for catalog relation may
+ *		invoke this function, which will in turn do catalog access, which may
+ *		in turn cause another similar WAL entry to be logged, leading to
+ *		infinite recursion.
+ */
+#define IndexIsAccessibleInLogicalDecoding(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX), \
+	 XLogLogicalInfoActive() && \
+	 RelationNeedsWAL(relation) && \
+	 (IsCatalogRelation(relation) || IndexIsUserCatalog(relation)))
+
 /*
  * RelationIsLogicallyLogged
  *		True if we need to log enough information to extract the data from the
-- 
2.34.1

#168

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#167)

Re: Minimal logical decoding on standbys

On Sat, Dec 10, 2022 at 3:09 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Attaching V30, mandatory rebase due to 66dcb09246.

It's a shame that this hasn't gotten more attention, because the topic
is important, but I'm as guilty of being too busy to spend a lot of
time on it as everyone else.

Anyway, while I'm not an expert on this topic, I did spend a little
time looking at it today, especially 0001. Here are a few comments:

I think that it's not good for IndexIsAccessibleInLogicalDecoding and
RelationIsAccessibleInLogicalDecoding to both exist. Indexes and
tables are types of relations, so this invites confusion: when the
object in question is an index, it would seem that either one can be
applied, based on the names. I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.

But here we can see where past thinking of this topic has been,
perhaps, a bit fuzzy. If that option were called user_catalog_relation
and had to be set on both tables and indexes as appropriate, then
RelationIsAccessibleInLogicalDecoding would already be doing the right
thing, and consequently there would be no need to add
IndexIsAccessibleInLogicalDecoding. I think we should explore the idea
of making the existing macro return the correct answer rather than
adding a new one. It's probably too late to redefine the semantics of
user_catalog_table, although if anyone wants to argue that we could
require logical decoding plugins to set this for both indexes and
tables, and/or rename to say relation instead of table, and/or add a
parallel reloption called user_catalog_index, then let's talk about
that.

Otherwise, I think we can consider adjusting the definition of
RelationIsUsedAsCatalogTable. The simplest way to do that would be to
make it check indisusercatalog for indexes and do what it does already
for tables. Then IndexIsUserCatalog and
IndexIsAccessibleInLogicalDecoding go away and
RelationIsAccessibleInLogicalDecoding returns the right answer in all
cases.

But I also wonder if a new pg_index column is really the right
approach here. One fairly obvious alternative is to try to use the
user_catalog_table reloption in both places. We could try to propagate
that reloption from the table to its indexes; whenever it's set or
unset on the table, push that down to each index. We'd have to take
care not to let the property be changed independently on indexes,
though. This feels a little grotty to me, but it does have the
advantage of symmetry. Another way to get symmetry is to go the other
way and add a new column pg_class.relusercatalog which gets used
instead of putting user_catalog_table in the reloptions, and
propagated down to indexes. But I have a feeling that the reloptions
code is not very well-structured to allow reloptions to be stored any
place but in pg_class.reloptions, so this may be difficult to
implement. Yet a third way is to have the index fetch the flag from
the associated table, perhaps when the relcache entry is built. But I
see no existing precedent for that in RelationInitIndexAccessInfo,
which I think is where it would be if we had it -- and that makes me
suspect that there might be good reasons why this isn't actually safe.
So while I do not really like the approach of storing the same
property in different ways for tables and for indexes, it's also not
really obvious to me how to do better.

Regarding the new flags that have been added to various WAL records, I
am a bit curious as to whether there's some way that we can avoid the
need to carry this information through the WAL, but I don't understand
why we don't need that now and do need that with this patch so it's
hard for me to think about that question in an intelligent way. If we
do need it, I think there might be cases where we should do something
smarter than just adding bool onCatalogAccessibleInLogicalDecoding to
the beginning of a whole bunch of WAL structs. In most cases we try to
avoid having padding bytes in the WAL struct. If we can, we try to lay
out the struct to avoid padding bytes. If we can't, we put the fields
requiring less alignment at the end of the struct and then have a
SizeOf<whatever> macro that is defined to not include the length of
any trailing padding which the compiler would insert. See, for
example, SizeOfHeapDelete. This patch doesn't do any of that, and it
should. It should also consider whether there's a way to avoid adding
any new bytes at all, e.g. it adds
onCatalogAccessibleInLogicalDecoding to xl_heap_visible, but that
struct has unused bits in 'flags'.

It would be very helpful if there were some place to refer to that
explained the design decisions here, like why the feature we're trying
to get requires this infrastructure around indexes to be added. It
could be in the commit messages, an email message, a README, or
whatever, but right now I don't see it anywhere in here.

--
Robert Haas
EDB: http://www.enterprisedb.com

#169

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#168)

Re: Minimal logical decoding on standbys

Hi,

On 12/12/22 6:41 PM, Robert Haas wrote:

On Sat, Dec 10, 2022 at 3:09 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Attaching V30, mandatory rebase due to 66dcb09246.

It's a shame that this hasn't gotten more attention, because the topic
is important, but I'm as guilty of being too busy to spend a lot of
time on it as everyone else.

Thanks for looking at it! Yeah, I think this is an important feature too.

Anyway, while I'm not an expert on this topic,

Then, we are two ;-)
I "just" resurrected this very old thread and do the best that I can to have it moving forward.

I did spend a little
time looking at it today, especially 0001. Here are a few comments:

I think that it's not good for IndexIsAccessibleInLogicalDecoding and
RelationIsAccessibleInLogicalDecoding to both exist. Indexes and
tables are types of relations, so this invites confusion: when the
object in question is an index, it would seem that either one can be
applied, based on the names.

Agree.

I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.

I think the Macro is returning the right answer when the relation is a user-catalog table.
I think the purpose is to identify relations that are permitted in READ only access during logical decoding.
Those are the ones that have been created by initdb in the pg_catalog schema, or have been marked as user provided catalog tables (that's what is documented in [1]https://www.postgresql.org/docs/current/logicaldecoding-output-plugin.html (Section 49.6.2. Capabilities)).

Or did you mean when the relation is "NOT" a user-catalog table?

But here we can see where past thinking of this topic has been,
perhaps, a bit fuzzy. If that option were called user_catalog_relation
and had to be set on both tables and indexes as appropriate, then
RelationIsAccessibleInLogicalDecoding would already be doing the right
thing, and consequently there would be no need to add
IndexIsAccessibleInLogicalDecoding.

Yeah, agree.

I think we should explore the idea
of making the existing macro return the correct answer rather than
adding a new one. It's probably too late to redefine the semantics of
user_catalog_table, although if anyone wants to argue that we could
require logical decoding plugins to set this for both indexes and
tables, and/or rename to say relation instead of table, and/or add a
parallel reloption called user_catalog_index, then let's talk about
that.

Otherwise, I think we can consider adjusting the definition of
RelationIsUsedAsCatalogTable. The simplest way to do that would be to
make it check indisusercatalog for indexes and do what it does already
for tables. Then IndexIsUserCatalog and
IndexIsAccessibleInLogicalDecoding go away and
RelationIsAccessibleInLogicalDecoding returns the right answer in all
cases.

That does sound a valid option to me too, I'll look at it.

But I also wonder if a new pg_index column is really the right
approach here. One fairly obvious alternative is to try to use the
user_catalog_table reloption in both places. We could try to propagate
that reloption from the table to its indexes; whenever it's set or
unset on the table, push that down to each index. We'd have to take
care not to let the property be changed independently on indexes,
though. This feels a little grotty to me, but it does have the
advantage of symmetry.

I thought about this approach too when working on it. But I thought it would be "weird" to start to propagate option(s) from table(s) to indexe(s). I mean, if that's an "option" why should it be propagated?
Furthermore, it seems to me that this option does behave more like a property (that affects logical decoding), more like logged/unlogged (being reflected in pg_class.relpersistence not in reloptions).

Another way to get symmetry is to go the other
way and add a new column pg_class.relusercatalog which gets used
instead of putting user_catalog_table in the reloptions, and
propagated down to indexes.

Yeah, agree (see my previous point above).

But I have a feeling that the reloptions
code is not very well-structured to allow reloptions to be stored any
place but in pg_class.reloptions, so this may be difficult to
implement.

Why don't remove this "property" from reloptions? (would probably need much more changes that the current approach and probably take care of upgrade scenario too).
I did not look in details but logged/unlogged is also propagated to the indexes, so maybe we could use the same approach here. But is it worth the probably added complexity (as compare to the current approach)?

Yet a third way is to have the index fetch the flag from
the associated table, perhaps when the relcache entry is built. But I
see no existing precedent for that in RelationInitIndexAccessInfo,
which I think is where it would be if we had it -- and that makes me
suspect that there might be good reasons why this isn't actually safe.
So while I do not really like the approach of storing the same
property in different ways for tables and for indexes, it's also not
really obvious to me how to do better.

I share the same thought and that's why I ended up doing it that way.

Regarding the new flags that have been added to various WAL records, I
am a bit curious as to whether there's some way that we can avoid the
need to carry this information through the WAL, but I don't understand
why we don't need that now and do need that with this patch so it's
hard for me to think about that question in an intelligent way. If we
do need it, I think there might be cases where we should do something
smarter than just adding bool onCatalogAccessibleInLogicalDecoding to
the beginning of a whole bunch of WAL structs. In most cases we try to
avoid having padding bytes in the WAL struct. If we can, we try to lay
out the struct to avoid padding bytes. If we can't, we put the fields
requiring less alignment at the end of the struct and then have a
SizeOf<whatever> macro that is defined to not include the length of
any trailing padding which the compiler would insert. See, for
example, SizeOfHeapDelete. This patch doesn't do any of that, and it
should. It should also consider whether there's a way to avoid adding
any new bytes at all, e.g. it adds
onCatalogAccessibleInLogicalDecoding to xl_heap_visible, but that
struct has unused bits in 'flags'.

Thanks for the hints! I'll look at it.

It would be very helpful if there were some place to refer to that
explained the design decisions here, like why the feature we're trying
to get requires this infrastructure around indexes to be added. It
could be in the commit messages, an email message, a README, or
whatever, but right now I don't see it anywhere in here.

Like adding something around those lines in the commit message?

"
On a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed.
This is done thanks to the catalog_xmin associated with the logical replication slot.

With logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then, hot_standby_feedback will work, but only while the connection is alive (for example a node restart would break it)

Then the primary may delete system catalog rows that could be needed by the logical decoding on the standby. Then, it’s mandatory to identify those rows and invalidate the slots that may need them if any.
"

[1]: https://www.postgresql.org/docs/current/logicaldecoding-output-plugin.html (Section 49.6.2. Capabilities)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#170

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#169)

Re: Minimal logical decoding on standbys

On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.

[ confusion ]

Sorry, I meant: RelationIsAccessibleInLogicalDecoding is returning
*the wrong answer* when the relation is an *INDEX*.

But I have a feeling that the reloptions
code is not very well-structured to allow reloptions to be stored any
place but in pg_class.reloptions, so this may be difficult to
implement.

Why don't remove this "property" from reloptions? (would probably need much more changes that the current approach and probably take care of upgrade scenario too).
I did not look in details but logged/unlogged is also propagated to the indexes, so maybe we could use the same approach here. But is it worth the probably added complexity (as compare to the current approach)?

I feel like changing the user-facing syntax is probably not a great
idea, as it inflicts upgrade pain that I don't see how we can really
fix.

It would be very helpful if there were some place to refer to that
explained the design decisions here, like why the feature we're trying
to get requires this infrastructure around indexes to be added. It
could be in the commit messages, an email message, a README, or
whatever, but right now I don't see it anywhere in here.

Like adding something around those lines in the commit message?

"
On a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed.
This is done thanks to the catalog_xmin associated with the logical replication slot.

With logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then, hot_standby_feedback will work, but only while the connection is alive (for example a node restart would break it)

Then the primary may delete system catalog rows that could be needed by the logical decoding on the standby. Then, it’s mandatory to identify those rows and invalidate the slots that may need them if any.
"

This is very helpful, yes. I think perhaps we need to work some of
this into the code comments someplace, but getting it into the commit
message would be a good first step.

What I infer from the above is that the overall design looks like this:

- We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.
- Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.
- To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.
- We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency. (Is this true? I
think so.)
- Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Does that sound right?

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important, and also because there aren't any
record types that fall into this category that are going to be emitted
so frequently as to make it a performance problem. But it's certainly
more complicated than one might wish.

--
Robert Haas
EDB: http://www.enterprisedb.com

#171

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#170)

Re: Minimal logical decoding on standbys

Hi,

On 12/13/22 2:50 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.

[ confusion ]

Sorry, I meant: RelationIsAccessibleInLogicalDecoding is returning
*the wrong answer* when the relation is an *INDEX*.

Yeah, agree. Will fix it in the next patch proposal (adding the index case in it as you proposed up-thread).

It would be very helpful if there were some place to refer to that
explained the design decisions here, like why the feature we're trying
to get requires this infrastructure around indexes to be added. It
could be in the commit messages, an email message, a README, or
whatever, but right now I don't see it anywhere in here.

Like adding something around those lines in the commit message?

"
On a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed.
This is done thanks to the catalog_xmin associated with the logical replication slot.

With logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then, hot_standby_feedback will work, but only while the connection is alive (for example a node restart would break it)

Then the primary may delete system catalog rows that could be needed by the logical decoding on the standby. Then, it’s mandatory to identify those rows and invalidate the slots that may need them if any.
"

This is very helpful, yes. I think perhaps we need to work some of
this into the code comments someplace, but getting it into the commit
message would be a good first step.

Thanks, will do.

What I infer from the above is that the overall design looks like this:

- We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.
- Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.
- To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.
- We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency. (Is this true? I
think so.)
- Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Does that sound right?

Yeah, that sounds all right to me.
One option could be to add my proposed wording in the commit message and put your wording above in a README.

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important,

Agree and I don't think that there is other option than adding some payload in some WAL records (at the very beginning the proposal was to periodically log a new record
that announces the current catalog xmin horizon).

and also because there aren't any
record types that fall into this category that are going to be emitted
so frequently as to make it a performance problem.

If no objections from your side, I'll submit a patch proposal by tomorrow, which:

- get rid of IndexIsAccessibleInLogicalDecoding
- let RelationIsAccessibleInLogicalDecoding deals with the index case
- takes care of the padding where the new bool is added
- convert this new bool to a flag for the xl_heap_visible case (adding a new bit to the already existing flag)
- Add my proposed wording above to the commit message
- Add your proposed wording above in a README

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#172

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#171)

Re: Minimal logical decoding on standbys

On Tue, Dec 13, 2022 at 11:37 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important,

Agree and I don't think that there is other option than adding some payload in some WAL records (at the very beginning the proposal was to periodically log a new record
that announces the current catalog xmin horizon).

Hmm, why did we abandon that approach? It sounds like it has some promise.

--
Robert Haas
EDB: http://www.enterprisedb.com

#173

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#172)

Re: Minimal logical decoding on standbys

On 12/13/22 5:42 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 11:37 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important,

Agree and I don't think that there is other option than adding some payload in some WAL records (at the very beginning the proposal was to periodically log a new record
that announces the current catalog xmin horizon).

Hmm, why did we abandon that approach? It sounds like it has some promise.

I should have put the reference to the discussion up-thread, it's in [1]/messages/by-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de.

[1]: /messages/by-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#174

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#173)

Re: Minimal logical decoding on standbys

On Tue, Dec 13, 2022 at 11:46 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Agree and I don't think that there is other option than adding some payload in some WAL records (at the very beginning the proposal was to periodically log a new record
that announces the current catalog xmin horizon).

Hmm, why did we abandon that approach? It sounds like it has some promise.

I should have put the reference to the discussion up-thread, it's in [1].

[1]: /messages/by-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de

Gotcha, thanks.

--
Robert Haas
EDB: http://www.enterprisedb.com

#175

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#174)

Re: Minimal logical decoding on standbys

On 12/13/22 5:49 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 11:46 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Agree and I don't think that there is other option than adding some payload in some WAL records (at the very beginning the proposal was to periodically log a new record
that announces the current catalog xmin horizon).

Hmm, why did we abandon that approach? It sounds like it has some promise.

I should have put the reference to the discussion up-thread, it's in [1].

[1]: /messages/by-id/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de

Gotcha, thanks.

You're welcome!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#176

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#171)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/13/22 5:37 PM, Drouvot, Bertrand wrote:

Hi,

On 12/13/22 2:50 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important,

Agree and I don't think that there is other option than adding some payload in some WAL records (at the very beginning the proposal was to periodically log a new record
that announces the current catalog xmin horizon).

and also because there aren't any
record types that fall into this category that are going to be emitted
so frequently as to make it a performance problem.

+1

If no objections from your side, I'll submit a patch proposal by tomorrow, which:

- get rid of IndexIsAccessibleInLogicalDecoding
- let RelationIsAccessibleInLogicalDecoding deals with the index case
- takes care of the padding where the new bool is added
- convert this new bool to a flag for the xl_heap_visible case (adding a new bit to the already existing flag)
- Add my proposed wording above to the commit message
- Add your proposed wording above in a README

Please find attached v31 with the changes mentioned above (except that I put your wording into the commit message instead of a README: I think it helps to make
clear what the "design" for the patch series is).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v31-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v31-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From df311456fa6f48ab8ed4f7554ebbd82ad28e6aa9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 14 Dec 2022 12:24:22 +0000
Subject: [PATCH v31 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d5a81f9d83..ac8b169ab5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v31-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v31-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 3f91b3be9ec0908bb56986b480e7acaa9d3d3d52 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 14 Dec 2022 12:23:27 +0000
Subject: [PATCH v31 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v31-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v31-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 79a6a6d7e27ea3807a113839a7a42354203a43ae Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 14 Dec 2022 12:22:35 +0000
Subject: [PATCH v31 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..d68ee9b663 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v31-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v31-0003-Allow-logical-decoding-on-standby.patchDownload

From 100724c899db0795ef63fed04b9c3d690fec73ec Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 14 Dec 2022 12:21:33 +0000
Subject: [PATCH v31 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4503fb36d..00f021942f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4471,6 +4471,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v31-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v31-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From b8a17491362aa0196254bd13a54a02546ec1a55a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 14 Dec 2022 12:20:50 +0000
Subject: [PATCH v31 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.8% doc/src/sgml/
   5.1% src/backend/access/transam/
   4.7% src/backend/access/
   3.8% src/backend/replication/logical/
  57.3% src/backend/replication/
   7.3% src/backend/storage/ipc/
   7.7% src/backend/tcop/
   3.3% src/backend/
   5.9% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 11a8ebe5ec..19b50960ec 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index ee58ba7de7..a05bfd463a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..61565f905d 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bb5c953cee..72df06ae1d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8697,6 +8697,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -8866,6 +8867,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING,
 											rlocator);
 
 	/*
@@ -9121,6 +9123,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..6289f8d250 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->onCatalogAccessibleInLogicalDecoding,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->onCatalogAccessibleInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..6ad467117f 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->onCatalogAccessibleInLogicalDecoding,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..e4503fb36d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7967,6 +7967,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..553953959d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool onCatalogAccessibleInLogicalDecoding,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (onCatalogAccessibleInLogicalDecoding)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool onCatalogAccessibleInLogicalDecoding,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											onCatalogAccessibleInLogicalDecoding,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 01d264b5ab..05da83bf5b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 04a5a99002..95351c927b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1017,6 +1017,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1050,6 +1052,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 719599649a..6284c9790c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5539,6 +5539,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..be86a1246f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool onCatalogAccessibleInLogicalDecoding,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool onCatalogAccessibleInLogicalDecoding,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v31-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v31-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From d1ad6d76e89138dd34cd1abe731c680649a25c19 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 14 Dec 2022 12:19:19 +0000
Subject: [PATCH v31 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogAccessibleInLogicalDecoding in such WAL records (and a new bit
set in the xl_heap_visible flags field), that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 29 +++++++++++++
 contrib/test_decoding/sql/ddl.sql       |  7 ++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  5 ++-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 14 +++++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++++++-
 src/include/access/gistxlog.h           |  6 ++-
 src/include/access/hash_xlog.h          |  3 +-
 src/include/access/heapam_xlog.h        |  8 ++--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 ++++++-
 21 files changed, 160 insertions(+), 21 deletions(-)
  10.2% contrib/test_decoding/expected/
   6.0% contrib/test_decoding/sql/
   5.9% doc/src/sgml/
   8.0% src/backend/access/heap/
   6.6% src/backend/access/
   5.4% src/backend/catalog/
  20.6% src/backend/commands/
  29.8% src/include/access/
   6.0% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..40cf2f4dc4 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,8 +493,15 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
@@ -506,6 +514,13 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
@@ -519,8 +534,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 -- make sure rewrites don't work
@@ -538,8 +560,15 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..85ddd4be03 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,19 +276,25 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
@@ -299,6 +305,7 @@ ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..459539b761 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..ee58ba7de7 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..7e59a384af 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db50376..bb5c953cee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6831,6 +6831,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8248,7 +8249,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8260,6 +8261,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if RelationIsAccessibleInLogicalDecoding(rel)
+		xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..aae78f7144 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..a6d0782310 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..07d1a02721 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.onCatalogAccessibleInLogicalDecoding = RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f6b2c9ac71 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -735,6 +738,7 @@ index_create(Relation heapRelation,
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
 	bool		create_storage = !RelFileNumberIsValid(relFileNumber);
+	bool		isusercatalog = false;
 
 	/* constraint flags can only be set when a constraint is requested */
 	Assert((constr_flags == 0) ||
@@ -1014,13 +1018,17 @@ index_create(Relation heapRelation,
 	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
 	 * ----------------
 	 */
+	if (heapRelation->rd_options)
+		isusercatalog = ((StdRdOptions *) (heapRelation)->rd_options)->user_catalog_table;
+
 	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
 						indexInfo,
 						collationObjectId, classObjectId, coloptions,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						isusercatalog);
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 56dc995713..fd8200e670 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..50ed9c652c 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        onCatalogAccessibleInLogicalDecoding;
 
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, onCatalogAccessibleInLogicalDecoding) + sizeof(bool))
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        onCatalogAccessibleInLogicalDecoding;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, onCatalogAccessibleInLogicalDecoding) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..e35f6b4577 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,13 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool        onCatalogAccessibleInLogicalDecoding;
 
 	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_hash_vacuum_one_page;
 
 #define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+	(offsetof(xl_hash_vacuum_one_page, onCatalogAccessibleInLogicalDecoding) + sizeof(bool))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..2baa54aa89 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        onCatalogAccessibleInLogicalDecoding;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, onCatalogAccessibleInLogicalDecoding) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        onCatalogAccessibleInLogicalDecoding;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, onCatalogAccessibleInLogicalDecoding) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..ecffd3913c 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        onCatalogAccessibleInLogicalDecoding;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, onCatalogAccessibleInLogicalDecoding) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        onCatalogAccessibleInLogicalDecoding;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, onCatalogAccessibleInLogicalDecoding) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..8bf0ff59c2 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        onCatalogAccessibleInLogicalDecoding;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 2803ef5c07..6005df3278 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..5d41ef6505 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#177

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#176)

Re: Minimal logical decoding on standbys

On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached v31 with the changes mentioned above (except that I put your wording into the commit message instead of a README: I think it helps to make
clear what the "design" for the patch series is).

Thanks, I think that's good clarification.

I read through 0001 again and I noticed this:

typedef struct xl_heap_prune
{
TransactionId snapshotConflictHorizon;
uint16 nredirected;
uint16 ndead;
+ bool onCatalogAccessibleInLogicalDecoding;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_prune;

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.

I wonder what the best fix is here. We could (1) have
heap_page_prune_execute copy the data into a newly-palloc'd chunk,
which seems kind of sad on performance grounds, or we could (2) just
make the field here two bytes, or add a second byte as padding, but
that bloats the WAL slightly, or we could (3) try to steal a bit from
ndirected or ndead, if we think that we don't need all the bits. It
seems like the maximum block size is 32kB right now, which means
MaxOffsetNumber can't, I think, be more than 16kB. So maybe we could
think of replacing nredirected and ndead with uint32 flags and then
have accessor macros.

But it looks like we also have a bunch of similar issues elsewhere.
gistxlogDelete looks like it has the same problem. gistxlogPageReuse
is OK because there's no data afterwards. xl_hash_vacuum_one_page
looks like it has the same problem. So does xl_heap_prune.
xl_heap_freeze_page also has the issue: heap_xlog_freeze_page does
memcpy the plans, but not the offsets, and even for the plans, I think
for correctness we would need to treat the "plans" pointer as a void *
or char * because the pointer might be unaligned and the compiler, not
knowing that, could do bad things.

xl_btree_reuse_page is OK because no data follows the main record.
xl_btree_delete appears to have this problem if you just look at the
comments, because it says that offset numbers follow, and thus are
probably presumed aligned. However, they don't really follow, because
commit d2e5e20e57111cca9e14f6e5a99a186d4c66a5b7 moved the data from
the main data to the registered buffer data. However, AIUI, you're not
really supposed to assume that the registered buffer data is aligned.
I think this just happens to work because the length of the main
record is a multiple of the relevant small integer, and the size of
the block header is 4, so the buffer data ends up being accidentally
aligned. That might be worth fixing somehow independently of this
issue.

spgxlogVacuumRedirect is OK because the offsets array is part of the
struct, using FLEXIBLE_ARRAY_MEMBER, which will cause the offsets
field to be aligned properly. It means inserting a padding byte, but
it's not broken. If we don't mind adding padding bytes in some of the
other cases, we could potentially make use of this technique
elsewhere, I think.

Other comments:

+    if RelationIsAccessibleInLogicalDecoding(rel)
+        xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;

This is a few parentheses short of where it should be. Hilariously it
still compiles because there are parentheses in the macro definition.

+ xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

These lines are quite long. I think we should consider (1) picking a
shorter name for the xlrec field and, if it's such lines are going to
still routinely exceed 80 characters, (2) splitting them into two
lines, with the second one indented to match pgindent's preferences in
such cases, which I think is something like this:

xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

As far as renaming, I think we could at least remove onCatalog part
from the identifier, as that doesn't seem to be adding much. And maybe
we could even think of changing it to something like
logicalDecodingConflict or even decodingConflict, which would shave
off a bunch more characters.

+    if (heapRelation->rd_options)
+        isusercatalog = ((StdRdOptions *)
(heapRelation)->rd_options)->user_catalog_table;

Couldn't you get rid of the if statement here and also the
initialization at the top of the function and just write isusercatalog
= RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
the variable entirely and pass
RelationIsUsedAsCatalogTable(heapRelation) as the argument to
UpdateIndexRelation directly?

I think this could use some test cases demonstrating that
indisusercatalog gets set correctly in all the relevant cases: table
is created with user_catalog_table = true/false, reloption is changed,
reloptions are reset, new index is added later, etc.

--
Robert Haas
EDB: http://www.enterprisedb.com

#178

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#177)

Re: Minimal logical decoding on standbys

Hi,

On 2022-12-14 10:55:31 -0500, Robert Haas wrote:

I read through 0001 again and I noticed this:

typedef struct xl_heap_prune
{
TransactionId snapshotConflictHorizon;
uint16 nredirected;
uint16 ndead;
+ bool onCatalogAccessibleInLogicalDecoding;
/* OFFSET NUMBERS are in the block reference 0 */
} xl_heap_prune;

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.

I think the offset numbers are stored separately from the record, even
though it doesn't quite look like that in the above due to the way the
'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
block reference 0, the added boolean shouldn't make a difference
alignment wise?

Or am I misunderstanding your point?

Greetings,

Andres Freund

#179

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Andres Freund (#178)

Re: Minimal logical decoding on standbys

On Wed, Dec 14, 2022 at 12:35 PM Andres Freund <andres@anarazel.de> wrote:

typedef struct xl_heap_prune

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.

I think the offset numbers are stored separately from the record, even
though it doesn't quite look like that in the above due to the way the
'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
block reference 0, the added boolean shouldn't make a difference
alignment wise?

Or am I misunderstanding your point?

Oh, you're right. So this is another case similar to
xl_btree_reuse_page. In heap_xlog_prune(), we access the offset number
data like this:

redirected = (OffsetNumber *)
XLogRecGetBlockData(record, 0, &datalen);
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
nowunused = nowdead + ndead;
nunused = (end - nowunused);
heap_page_prune_execute(buffer,

redirected, nredirected,
nowdead, ndead,

nowunused, nunused);

This is only safe if the return value of XLogRecGetBlockData is
guaranteed to be properly aligned, and I think that there is no such
guarantee in general. I think that it happens to come out properly
aligned because both the main body of the record (xl_heap_prune) and
the length of a block header (XLogRecordBlockHeader) happen to be
sufficiently aligned. But we just recently had discussion about trying
to make WAL records smaller by various means, and some of those
proposals involved changing the length of XLogRecordBlockHeader. And
the patch proposed here increases SizeOfHeapPrune by 1. So I think
with the patch, the offset number array would become misaligned.

No?

--
Robert Haas
EDB: http://www.enterprisedb.com

#180

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#179)

Re: Minimal logical decoding on standbys

Hi,

On 12/14/22 6:48 PM, Robert Haas wrote:

On Wed, Dec 14, 2022 at 12:35 PM Andres Freund <andres@anarazel.de> wrote:

typedef struct xl_heap_prune

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.

I think the offset numbers are stored separately from the record, even
though it doesn't quite look like that in the above due to the way the
'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
block reference 0, the added boolean shouldn't make a difference
alignment wise?

Or am I misunderstanding your point?

Oh, you're right. So this is another case similar to
xl_btree_reuse_page. In heap_xlog_prune(), we access the offset number
data like this:

redirected = (OffsetNumber *)
XLogRecGetBlockData(record, 0, &datalen);
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
nowunused = nowdead + ndead;
nunused = (end - nowunused);
heap_page_prune_execute(buffer,

redirected, nredirected,
nowdead, ndead,

nowunused, nunused);

This is only safe if the return value of XLogRecGetBlockData is
guaranteed to be properly aligned,

Why, could you please elaborate?

It looks to me that here we are "just" accessing the
members of the xl_heap_prune struct to get the numbers.

Then, the actual data will be read later in heap_page_prune_execute() from the buffer/page based on the numbers we got from xl_heap_prune.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#181

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#177)

Re: Minimal logical decoding on standbys

Hi,

On 12/14/22 4:55 PM, Robert Haas wrote:

On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand>
Other comments:
+    if RelationIsAccessibleInLogicalDecoding(rel)
+        xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
This is a few parentheses short of where it should be. Hilariously it
still compiles because there are parentheses in the macro definition.

Oops, thanks will fix.

+ xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

These lines are quite long. I think we should consider (1) picking a
shorter name for the xlrec field and, if it's such lines are going to
still routinely exceed 80 characters, (2) splitting them into two
lines, with the second one indented to match pgindent's preferences in
such cases, which I think is something like this:

xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

As far as renaming, I think we could at least remove onCatalog part
from the identifier, as that doesn't seem to be adding much. And maybe
we could even think of changing it to something like
logicalDecodingConflict or even decodingConflict, which would shave
off a bunch more characters.

I'm not sure I like the decodingConflict proposal. Indeed, it might be there is no conflict (depending of the xids
comparison).

What about "checkForConflict"?

+    if (heapRelation->rd_options)
+        isusercatalog = ((StdRdOptions *)
(heapRelation)->rd_options)->user_catalog_table;
Couldn't you get rid of the if statement here and also the
initialization at the top of the function and just write isusercatalog
= RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
the variable entirely and pass
RelationIsUsedAsCatalogTable(heapRelation) as the argument to
UpdateIndexRelation directly?

Yeah, that's better, will do, thanks!

While at it, I'm not sure that isusercatalog should be visible in pg_index.
I mean, this information could be retrieved with a join on pg_class (on the table the index is linked to), so the weirdness to have it visible.
I did not check how difficult it would be to make it "invisible" though.
What do you think?

I think this could use some test cases demonstrating that
indisusercatalog gets set correctly in all the relevant cases: table
is created with user_catalog_table = true/false, reloption is changed,
reloptions are reset, new index is added later, etc.

v31 already provides a few checks:

- After index creation on relation with user_catalog_table = true
- Propagation is done correctly after a user_catalog_table RESET
- Propagation is done correctly after an ALTER SET user_catalog_table = true
- Propagation is done correctly after an ALTER SET user_catalog_table = false

In v32, I can add a check for index creation after each of the last 3 mentioned above and one when a table is created with user_catalog_table = false.

Having said that, we would need a function to retrieve the isusercatalog value should we make it invisible.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#182

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#181)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/15/22 11:31 AM, Drouvot, Bertrand wrote:

Hi,

On 12/14/22 4:55 PM, Robert Haas wrote:
On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand> Other comments:
+    if RelationIsAccessibleInLogicalDecoding(rel)
+        xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
This is a few parentheses short of where it should be. Hilariously it
still compiles because there are parentheses in the macro definition.
Oops, thanks will fix.

Fixed in v32 attached.

+ xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

These lines are quite long. I think we should consider (1) picking a
shorter name for the xlrec field and, if it's such lines are going to
still routinely exceed 80 characters, (2) splitting them into two
lines, with the second one indented to match pgindent's preferences in
such cases, which I think is something like this:

xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

As far as renaming, I think we could at least remove onCatalog part
from the identifier, as that doesn't seem to be adding much. And maybe
we could even think of changing it to something like
logicalDecodingConflict or even decodingConflict, which would shave
off a bunch more characters.

I'm not sure I like the decodingConflict proposal. Indeed, it might be there is no conflict (depending of the xids
comparison).

What about "checkForConflict"?

In v32 attached, it is renamed to mayConflictInLogicalDecoding (I think it's important it reflects
that it is linked to the logical decoding and the "uncertainty" of the conflict). What do you think?

+    if (heapRelation->rd_options)
+        isusercatalog = ((StdRdOptions *)
(heapRelation)->rd_options)->user_catalog_table;
Couldn't you get rid of the if statement here and also the
initialization at the top of the function and just write isusercatalog
= RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
the variable entirely and pass
RelationIsUsedAsCatalogTable(heapRelation) as the argument to
UpdateIndexRelation directly?
Yeah, that's better, will do, thanks!

Fixed in v32 attached.

While at it, I'm not sure that isusercatalog should be visible in pg_index.
I mean, this information could be retrieved with a join on pg_class (on the table the index is linked to), so the weirdness to have it visible.
I did not check how difficult it would be to make it "invisible" though.
What do you think?

It's still visible in v32 attached.
I had a second thought on it and it does not seem like a "real" concern to me.

I think this could use some test cases demonstrating that
indisusercatalog gets set correctly in all the relevant cases: table
is created with user_catalog_table = true/false, reloption is changed,
reloptions are reset, new index is added later, etc.

v31 already provides a few checks:

- After index creation on relation with user_catalog_table = true
- Propagation is done correctly after a user_catalog_table RESET
- Propagation is done correctly after an ALTER SET user_catalog_table = true
- Propagation is done correctly after an ALTER SET user_catalog_table = false

In v32, I can add a check for index creation after each of the last 3 mentioned above and one when a table is created with user_catalog_table = false.

v32 attached is adding the checks mentioned above.

v32 does not change anything linked to the alignment discussion, as I think this will depend of its outcome.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v32-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v32-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 40764ded906d166757267b7e85fdc5568fa7a018 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Dec 2022 09:38:44 +0000
Subject: [PATCH v32 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d5a81f9d83..ac8b169ab5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v32-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v32-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 96a69a835497f5424a034d16a3e87a1cf0f8928b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Dec 2022 09:37:58 +0000
Subject: [PATCH v32 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v32-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v32-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From db540016dbc23d1819f63406ce71f837ea8658ff Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Dec 2022 09:37:14 +0000
Subject: [PATCH v32 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..d68ee9b663 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v32-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v32-0003-Allow-logical-decoding-on-standby.patchDownload

From 8bafb5384a1d004a0038cb3b46bd9e74693350f8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Dec 2022 09:35:50 +0000
Subject: [PATCH v32 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fca6ee4584..f9cc842a6a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v32-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v32-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 096888590563f3d567d0999c3b6c1f2061f7c71d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Dec 2022 09:34:38 +0000
Subject: [PATCH v32 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.8% doc/src/sgml/
   5.2% src/backend/access/transam/
   4.2% src/backend/access/
   3.8% src/backend/replication/logical/
  57.8% src/backend/replication/
   7.1% src/backend/storage/ipc/
   7.8% src/backend/tcop/
   3.3% src/backend/
   5.8% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 11a8ebe5ec..19b50960ec 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d02df9da5c..49b8903504 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->mayConflictInLogicalDecoding,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->mayConflictInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..c6a4a990cc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->mayConflictInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 55fcb11c33..58cf287ef1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8694,6 +8694,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->mayConflictInLogicalDecoding,
 											rlocator);
 
 	/*
@@ -8863,6 +8864,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING,
 											rlocator);
 
 	/*
@@ -9118,6 +9120,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->mayConflictInLogicalDecoding,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..e98525bfce 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->mayConflictInLogicalDecoding,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->mayConflictInLogicalDecoding,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..a85f30a516 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->mayConflictInLogicalDecoding,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..fca6ee4584 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7958,6 +7958,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..358fc7f615 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool mayConflictInLogicalDecoding,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (mayConflictInLogicalDecoding)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool mayConflictInLogicalDecoding,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											mayConflictInLogicalDecoding,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 01d264b5ab..05da83bf5b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 04a5a99002..95351c927b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1017,6 +1017,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1050,6 +1052,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 719599649a..6284c9790c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5539,6 +5539,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..b46b011154 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool mayConflictInLogicalDecoding,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool mayConflictInLogicalDecoding,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v32-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v32-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 268d034a0ebde21934506d609d6fa3ea89cf42d8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 16 Dec 2022 09:28:13 +0000
Subject: [PATCH v32 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
mayConflictInLogicalDecoding in such WAL records (and a new bit
set in the xl_heap_visible flags field), that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  2 +
 src/backend/access/hash/hashinsert.c    |  2 +
 src/backend/access/heap/heapam.c        |  6 ++-
 src/backend/access/heap/pruneheap.c     |  2 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  4 ++
 src/backend/access/spgist/spgvacuum.c   |  2 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           |  6 ++-
 src/include/access/hash_xlog.h          |  3 +-
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 21 files changed, 215 insertions(+), 21 deletions(-)
  27.0% contrib/test_decoding/expected/
  11.6% contrib/test_decoding/sql/
   4.5% doc/src/sgml/
   6.0% src/backend/access/heap/
   4.8% src/backend/access/
   3.0% src/backend/catalog/
  15.7% src/backend/commands/
  21.6% src/include/access/
   4.6% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..459539b761 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..d02df9da5c 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,8 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.mayConflictInLogicalDecoding =
+		RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..18628e7039 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,8 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.mayConflictInLogicalDecoding =
+				RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 42756a9e6d..55fcb11c33 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6827,6 +6827,8 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.mayConflictInLogicalDecoding =
+			RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8244,7 +8246,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8256,6 +8258,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..aa504b7339 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.mayConflictInLogicalDecoding =
+				RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..f1254db449 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,8 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.mayConflictInLogicalDecoding =
+		RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1360,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.mayConflictInLogicalDecoding =
+			RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..d6d4d77956 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,8 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.mayConflictInLogicalDecoding =
+		RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f7540f4101 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 56dc995713..fd8200e670 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..f5693b79e2 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        mayConflictInLogicalDecoding;
 
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, mayConflictInLogicalDecoding) + sizeof(bool))
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        mayConflictInLogicalDecoding;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, mayConflictInLogicalDecoding) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..627f98b9a1 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,13 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool        mayConflictInLogicalDecoding;
 
 	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_hash_vacuum_one_page;
 
 #define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+	(offsetof(xl_hash_vacuum_one_page, mayConflictInLogicalDecoding) + sizeof(bool))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..5df7aad0e1 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        mayConflictInLogicalDecoding;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, mayConflictInLogicalDecoding) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        mayConflictInLogicalDecoding;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, mayConflictInLogicalDecoding) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..aaa697afbb 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        mayConflictInLogicalDecoding;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, mayConflictInLogicalDecoding) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        mayConflictInLogicalDecoding;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, mayConflictInLogicalDecoding) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..ebd1107574 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        mayConflictInLogicalDecoding;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 2803ef5c07..6005df3278 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..5d41ef6505 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#183

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Drouvot, Bertrand (#182)

Re: Minimal logical decoding on standbys

Hi,

On 2022-12-16 11:33:50 +0100, Drouvot, Bertrand wrote:

@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
TransactionId snapshotConflictHorizon;
uint16		ndeleted;
uint16		nupdated;
+	bool        mayConflictInLogicalDecoding;

After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
it should be a riff on snapshotConflictHorizon?

Greetings,

Andres Freund

#184

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Andres Freund (#183)

Re: Minimal logical decoding on standbys

Hi,

On 12/16/22 2:51 PM, Andres Freund wrote:

Hi,

On 2022-12-16 11:33:50 +0100, Drouvot, Bertrand wrote:
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
TransactionId snapshotConflictHorizon;
uint16		ndeleted;
uint16		nupdated;
+	bool        mayConflictInLogicalDecoding;
After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
it should be a riff on snapshotConflictHorizon?

Gotcha, what about logicalSnapshotConflictThreat?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#185

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#184)

Re: Minimal logical decoding on standbys

On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
it should be a riff on snapshotConflictHorizon?

Gotcha, what about logicalSnapshotConflictThreat?

logicalConflictPossible? checkDecodingConflict?

I think we should try to keep this to three words if we can. There's
not likely to be enough value in a fourth word to make up for the
downside of being more verbose.

--
Robert Haas
EDB: http://www.enterprisedb.com

#186

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#185)

Re: Minimal logical decoding on standbys

Hi,

On 12/16/22 5:38 PM, Robert Haas wrote:

On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
it should be a riff on snapshotConflictHorizon?

Gotcha, what about logicalSnapshotConflictThreat?

logicalConflictPossible? checkDecodingConflict?

I think we should try to keep this to three words if we can. There's
not likely to be enough value in a fourth word to make up for the
downside of being more verbose.

Yeah agree, I'd vote for logicalConflictPossible then.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#187

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#186)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/16/22 6:24 PM, Drouvot, Bertrand wrote:

Hi,

On 12/16/22 5:38 PM, Robert Haas wrote:

On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
it should be a riff on snapshotConflictHorizon?

Gotcha, what about logicalSnapshotConflictThreat?

logicalConflictPossible? checkDecodingConflict?

I think we should try to keep this to three words if we can. There's
not likely to be enough value in a fourth word to make up for the
downside of being more verbose.

Yeah agree, I'd vote for logicalConflictPossible then.

Please find attached v33 using logicalConflictPossible as the new field name instead of mayConflictInLogicalDecoding.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v33-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v33-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 68890d7a0edcd997c0daab6e375a779656367797 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 08:38:56 +0000
Subject: [PATCH v33 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d5a81f9d83..ac8b169ab5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v33-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v33-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 877162c72b96bc57e3253e9d1156ccc63f3f605b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 08:38:10 +0000
Subject: [PATCH v33 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v33-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v33-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 9a3ab2bc33dfeea615b338ea13ec9c926971574d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 08:37:31 +0000
Subject: [PATCH v33 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..d68ee9b663 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v33-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v33-0003-Allow-logical-decoding-on-standby.patchDownload

From e482640d28a9460d24f722ccfaabf6171e24c9f8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 08:36:32 +0000
Subject: [PATCH v33 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fca6ee4584..f9cc842a6a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v33-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v33-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 837de702ef25df5d9c0e185a8f1b8644a114b2bb Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 08:35:39 +0000
Subject: [PATCH v33 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.8% doc/src/sgml/
   5.2% src/backend/access/transam/
   3.8% src/backend/access/
   3.9% src/backend/replication/logical/
  58.2% src/backend/replication/
   7.0% src/backend/storage/ipc/
   7.9% src/backend/tcop/
   3.3% src/backend/
   5.7% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 363b183e5f..27235418a6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 6e260f9aba..235776c5d3 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->logicalConflictPossible,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->logicalConflictPossible,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..f3dd5ae082 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->logicalConflictPossible,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 383fd76918..ba72b57ece 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8694,6 +8694,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->logicalConflictPossible,
 											rlocator);
 
 	/*
@@ -8863,6 +8864,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING,
 											rlocator);
 
 	/*
@@ -9118,6 +9120,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->logicalConflictPossible,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..ff4cd9f5e9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->logicalConflictPossible,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->logicalConflictPossible,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..ac4ec394eb 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->logicalConflictPossible,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..fca6ee4584 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7958,6 +7958,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..1afd119e01 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool logicalConflictPossible,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (logicalConflictPossible)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool logicalConflictPossible,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											logicalConflictPossible,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 01d264b5ab..05da83bf5b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 46f98fd67f..41eb6256ea 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 98d90d9338..21dd65a483 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5546,6 +5546,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..2f62fe5fc8 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool logicalConflictPossible,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool logicalConflictPossible,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v33-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v33-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From d0dfaf6e93f4771d849073abcab5e65aac0be921 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 08:32:04 +0000
Subject: [PATCH v33 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
logicalConflictPossible  in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  2 +
 src/backend/access/hash/hashinsert.c    |  2 +
 src/backend/access/heap/heapam.c        |  6 ++-
 src/backend/access/heap/pruneheap.c     |  2 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  4 ++
 src/backend/access/spgist/spgvacuum.c   |  2 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           |  6 ++-
 src/include/access/hash_xlog.h          |  3 +-
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 21 files changed, 215 insertions(+), 21 deletions(-)
  27.3% contrib/test_decoding/expected/
  11.7% contrib/test_decoding/sql/
   4.5% doc/src/sgml/
   6.0% src/backend/access/heap/
   4.6% src/backend/access/
   3.0% src/backend/catalog/
  15.9% src/backend/commands/
  21.0% src/include/access/
   4.6% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..459539b761 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..6e260f9aba 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,8 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.logicalConflictPossible =
+		RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..18eb052280 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,8 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.logicalConflictPossible =
+				RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 42756a9e6d..383fd76918 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6827,6 +6827,8 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.logicalConflictPossible =
+			RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8244,7 +8246,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8256,6 +8258,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..b2fc4d70ff 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.logicalConflictPossible =
+				RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..cbefa6cd88 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,8 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.logicalConflictPossible =
+		RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1360,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.logicalConflictPossible =
+			RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..12747aee09 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,8 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.logicalConflictPossible =
+		RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f7540f4101 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 56dc995713..fd8200e670 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..d954c0a9da 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        logicalConflictPossible;
 
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, logicalConflictPossible) + sizeof(bool))
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        logicalConflictPossible;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, logicalConflictPossible) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..c17d997b99 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,13 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool        logicalConflictPossible;
 
 	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_hash_vacuum_one_page;
 
 #define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+	(offsetof(xl_hash_vacuum_one_page, logicalConflictPossible) + sizeof(bool))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..76991d9d8f 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        logicalConflictPossible;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, logicalConflictPossible) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        logicalConflictPossible;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, logicalConflictPossible) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..c69e053869 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        logicalConflictPossible;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, logicalConflictPossible) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        logicalConflictPossible;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, logicalConflictPossible) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..1d92af6b2d 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        logicalConflictPossible;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 2803ef5c07..6005df3278 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..5d41ef6505 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#188

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#185)

Re: Minimal logical decoding on standbys

On 2022-12-16 11:38:33 -0500, Robert Haas wrote:

On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
it should be a riff on snapshotConflictHorizon?

Gotcha, what about logicalSnapshotConflictThreat?

logicalConflictPossible? checkDecodingConflict?

I think we should try to keep this to three words if we can. There's
not likely to be enough value in a fourth word to make up for the
downside of being more verbose.

I don't understand what the "may*" or "*Possible" really are
about. snapshotConflictHorizon is a conflict with a certain xid - there
commonly won't be anything to conflict with. If there's a conflict in
the logical-decoding-on-standby case, we won't be able to apply it only
sometimes or such.

How about "affectsLogicalDecoding", "conflictsWithSlots" or
"isCatalogRel" or such?

Greetings,

Andres Freund

#189

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Robert Haas (#179)

Re: Minimal logical decoding on standbys

On Wed, Dec 14, 2022 at 12:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

No?

Nope, I was wrong. The block reference data is stored in the WAL
record *before* the main data, so it was wrong to imagine (as I did)
that the alignment of the main data would affect the alignment of the
block data. If anything, it's the other way around. That means that
the only records where this patch could conceivably cause a problem
are those where something is stored in the main data after the main
struct. And there aren't many of those, because an awful lot of record
types have moved to using the block data.

I'm going to go through all the record types one by one before
commenting further.

--
Robert Haas
EDB: http://www.enterprisedb.com

#190

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Andres Freund (#188)

Re: Minimal logical decoding on standbys

On Tue, Dec 20, 2022 at 1:19 PM Andres Freund <andres@anarazel.de> wrote:

I don't understand what the "may*" or "*Possible" really are
about. snapshotConflictHorizon is a conflict with a certain xid - there
commonly won't be anything to conflict with. If there's a conflict in
the logical-decoding-on-standby case, we won't be able to apply it only
sometimes or such.

The way I was imagining it is that snapshotConflictHorizon tells us
whether there is a conflict with this record and then, if there is,
this new Boolean tells us whether it's relevant to logical decoding as
well.

How about "affectsLogicalDecoding", "conflictsWithSlots" or
"isCatalogRel" or such?

isCatalogRel seems fine to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#191

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#190)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/20/22 7:31 PM, Robert Haas wrote:

On Tue, Dec 20, 2022 at 1:19 PM Andres Freund <andres@anarazel.de> wrote:

I don't understand what the "may*" or "*Possible" really are
about. snapshotConflictHorizon is a conflict with a certain xid - there
commonly won't be anything to conflict with. If there's a conflict in
the logical-decoding-on-standby case, we won't be able to apply it only
sometimes or such.

The way I was imagining it is that snapshotConflictHorizon tells us
whether there is a conflict with this record and then, if there is,
this new Boolean tells us whether it's relevant to logical decoding as
well.

the "may*" or "*Possible" was, most probably, because I preferred to have the uncertainty of the conflict mentioned in the name.
But, somehow, I was forgetting about the relationship with snapshotConflictHorizon.
So, agree with both of you that mentioning about the uncertainty of the conflict is useless.

How about "affectsLogicalDecoding", "conflictsWithSlots" or
"isCatalogRel" or such?

isCatalogRel seems fine to me.

For me too, please find attached v34 using it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v34-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v34-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 46a728b39f7ea85f2dec60d72cb400094955b785 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 19:56:22 +0000
Subject: [PATCH v34 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d5a81f9d83..ac8b169ab5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v34-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v34-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 3fdfcf1f6e836e87091c2047cc33338ef7abd8b5 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 19:55:38 +0000
Subject: [PATCH v34 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v34-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v34-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 5429feffc82c7cf18482d5e95da90b5f74e1e9c2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 19:54:46 +0000
Subject: [PATCH v34 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b8c3c104ae..81913bdfd6 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v34-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v34-0003-Allow-logical-decoding-on-standby.patchDownload

From 21ea7c8c793e0e0bfce764811e78aa540753f16f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 19:53:39 +0000
Subject: [PATCH v34 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fca6ee4584..f9cc842a6a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v34-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v34-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From dd9e7719d008f8c54ab0f59c31b576a5881e36e9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 19:52:46 +0000
Subject: [PATCH v34 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.9% doc/src/sgml/
   5.3% src/backend/access/transam/
   3.1% src/backend/access/
   3.9% src/backend/replication/logical/
  59.0% src/backend/replication/
   6.7% src/backend/storage/ipc/
   8.0% src/backend/tcop/
   3.4% src/backend/
   5.6% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 363b183e5f..27235418a6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f47587b8f5..285126ce50 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -196,6 +196,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -396,6 +397,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..75dd33e581 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1001,6 +1001,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1b344eace7..1116eb3e3a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8693,6 +8693,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8862,6 +8863,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING,
 											rlocator);
 
 	/*
@@ -9117,6 +9119,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..cfede906a3 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..20165bc588 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..fca6ee4584 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7958,6 +7958,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..f78cf5de68 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (isCatalogRel)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 01d264b5ab..05da83bf5b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 46f98fd67f..41eb6256ea 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 98d90d9338..21dd65a483 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5546,6 +5546,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..7df66d6136 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v34-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v34-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From df1ed6b773908407c2c165eef1627eb54be11d10 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 20 Dec 2022 19:46:06 +0000
Subject: [PATCH v34 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      |  1 +
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  5 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           |  6 ++-
 src/include/access/hash_xlog.h          |  3 +-
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 21 files changed, 208 insertions(+), 21 deletions(-)
  28.0% contrib/test_decoding/expected/
  12.0% contrib/test_decoding/sql/
   4.7% doc/src/sgml/
   5.8% src/backend/access/heap/
   4.0% src/backend/access/
   3.1% src/backend/catalog/
  16.3% src/backend/commands/
  19.9% src/include/access/
   4.7% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..459539b761 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..f47587b8f5 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -608,6 +608,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..06c2659068 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 42756a9e6d..1b344eace7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6827,6 +6827,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8244,7 +8245,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8256,6 +8257,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..184e5123af 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..426a5df4fb 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..2e62e3fa3b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f7540f4101 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 56dc995713..fd8200e670 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..2d293fc8f4 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
 	/*
 	 * In payload of blk 0 : todelete OffsetNumbers
 	 */
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, isCatalogRel) + sizeof(bool))
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..1df1c626e5 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,13 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool        isCatalogRel;
 
 	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_hash_vacuum_one_page;
 
 #define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+	(offsetof(xl_hash_vacuum_one_page, isCatalogRel) + sizeof(bool))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..68cacd532a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..fbeb9cfbe0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..2ec0931a12 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 2803ef5c07..6005df3278 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..5d41ef6505 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#192

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Robert Haas (#189)

2 attachment(s)

Re: Minimal logical decoding on standbys

On Tue, Dec 20, 2022 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'm going to go through all the record types one by one before
commenting further.

OK, so xl_hash_vacuum_one_page, at least, is a live issue. To reproduce:

./configure <whatever your usual options are>
echo 'COPT=-fsanitize=alignment -fno-sanitize-recover=all' > src/Makefile.custom
make -j8
make install
initdb
postgres

Then in another window:

pg_basebackup -D pgstandby -R
# edit postgresql.conf, change port number
postgres -D pgstandby

Then in a third window, using the attached files:

pgbench -i -s 10
psql -f kpt_hash_setup.sql
pgbench -T 10 -c 4 -j 4 -f kpt_hash_pgbench.sql

With the patch, the standby falls over:

bufpage.c:1194:31: runtime error: load of misaligned address
0x7fa62f05d119 for type 'OffsetNumber' (aka 'unsigned short'), which
requires 2 byte alignment
0x7fa62f05d119: note: pointer points here
00 00 00 00 e5 00 8f 00 00 00 00 87 00 ab 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20
^

Without the patch, this doesn't occur.

I think this might be the only WAL record type where there's a
problem, but I haven't fully confirmed that yet.

--
Robert Haas
EDB: http://www.enterprisedb.com

#193

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Robert Haas (#192)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think this might be the only WAL record type where there's a
problem, but I haven't fully confirmed that yet.

It's not. GIST has the same issue. The same test case demonstrates the
problem there, if you substitute this test script for
kpt_hash_setup.sql and possibly also run it for somewhat longer. One
might think that this wouldn't be a problem, because the comments for
gistxlogDelete say this:

/*
* In payload of blk 0 : todelete OffsetNumbers
*/

But it's not in the payload of blk 0. It follows the main payload.

This is the reverse of xl_heap_freeze_page, which claims that freeze
plans and offset numbers follow, but they don't: they're in the data
for block 0. xl_btree_delete is also wrong, claiming that the data
follows when it's really attached to block 0. I guess whatever else we
do here, we should fix the comments.

Bottom line is that I think the two cases that have alignment issues
as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
else is OK, as far as I can tell right now.

--
Robert Haas
EDB: http://www.enterprisedb.com

#194

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Robert Haas (#193)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/20/22 10:41 PM, Robert Haas wrote:

On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think this might be the only WAL record type where there's a
problem, but I haven't fully confirmed that yet.

It's not. GIST has the same issue. The same test case demonstrates the
problem there, if you substitute this test script for
kpt_hash_setup.sql and possibly also run it for somewhat longer. One
might think that this wouldn't be a problem, because the comments for
gistxlogDelete say this:

/*
* In payload of blk 0 : todelete OffsetNumbers
*/

But it's not in the payload of blk 0. It follows the main payload.

Oh right, nice catch!

Indeed, we can see in gistRedoDeleteRecord():

"
todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
"

This is the reverse of xl_heap_freeze_page, which claims that freeze
plans and offset numbers follow, but they don't: they're in the data
for block 0.

oh right, we can see in heap_xlog_freeze_page():

"
plans = (xl_heap_freeze_plan *) XLogRecGetBlockData(record, 0, NULL);
offsets = (OffsetNumber *) ((char *) plans +
(xlrec->nplans *
sizeof(xl_heap_freeze_plan)));
"

xl_btree_delete is also wrong, claiming that the data
follows when it's really attached to block 0.

oh right, we can see in btree_xlog_delete():

"
char *ptr = XLogRecGetBlockData(record, 0, NULL);

page = (Page) BufferGetPage(buffer);

if (xlrec->nupdated > 0)
{
OffsetNumber *updatedoffsets;
xl_btree_update *updates;

updatedoffsets = (OffsetNumber *)
(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
updates = (xl_btree_update *) ((char *) updatedoffsets +
xlrec->nupdated *
sizeof(OffsetNumber));
"

I guess whatever else we
do here, we should fix the comments.

Agree, please find attached a patch proposal doing so.

Bottom line is that I think the two cases that have alignment issues
as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
else is OK, as far as I can tell right now.

Thanks a lot for the repro(s) and explanations! That's very useful/helpful.

Based on your discovery about the wrong comments above, I'm now tempted to fix those 2 alignment issues
by using a FLEXIBLE_ARRAY_MEMBER within those structs (as you proposed in [1]/messages/by-id/CA+TgmoaVcu_mbxbH=EccvKG6u8+MdQf9zx98uAL9zsStFwrYsQ@mail.gmail.com) (as that should also prevent
any possible wrong comments about where the array is located).

What do you think?

[1]: /messages/by-id/CA+TgmoaVcu_mbxbH=EccvKG6u8+MdQf9zx98uAL9zsStFwrYsQ@mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-fix-some-comments.patchtext/plain; charset=UTF-8; name=v1-0001-fix-some-comments.patchDownload

diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..95068feb87 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -52,9 +52,7 @@ typedef struct gistxlogDelete
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
 } gistxlogDelete;
 
 #define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..1117e95ede 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -345,8 +345,9 @@ typedef struct xl_heap_freeze_page
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
 
-	/* FREEZE PLANS FOLLOW */
-	/* OFFSET NUMBER ARRAY FOLLOWS */
+	/*
+	 * In payload of blk 0 : FREEZE PLANS and OFFSET NUMBER ARRAY
+	 */
 } xl_heap_freeze_page;
 
 #define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..1aa8e7eca5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -236,9 +236,12 @@ typedef struct xl_btree_delete
 	uint16		ndeleted;
 	uint16		nupdated;
 
-	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
-	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
-	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
+	/*
+	 * In payload of blk 0 :
+	 * - DELETED TARGET OFFSET NUMBERS
+	 * - UPDATED TARGET OFFSET NUMBERS
+	 * - UPDATED TUPLES METADATA (xl_btree_update) ARRAY
+	 */
 } xl_btree_delete;
 
 #define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))

#195

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#194)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/21/22 10:06 AM, Drouvot, Bertrand wrote:

Hi,

On 12/20/22 10:41 PM, Robert Haas wrote:

On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
I guess whatever else we
do here, we should fix the comments.

Agree, please find attached a patch proposal doing so.

Bottom line is that I think the two cases that have alignment issues
as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
else is OK, as far as I can tell right now.

Thanks a lot for the repro(s) and explanations! That's very useful/helpful.

Based on your discovery about the wrong comments above, I'm now tempted to fix those 2 alignment issues
by using a FLEXIBLE_ARRAY_MEMBER within those structs (as you proposed in [1]) (as that should also prevent
any possible wrong comments about where the array is located).

What do you think?

As mentioned above, It looks to me that making use of a FLEXIBLE_ARRAY_MEMBER is a good choice.
So, please find attached v35 making use of a FLEXIBLE_ARRAY_MEMBER in xl_hash_vacuum_one_page and gistxlogDelete (your 2 repros are not failing anymore).
I've also added a few words in the commit message in 0001 about it.

So, we end up with:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset | size */ type = struct xl_hash_vacuum_one_page {
/* 0 | 4 */ TransactionId snapshotConflictHorizon;
/* 4 | 4 */ int ntuples;
/* 8 | 1 */ _Bool isCatalogRel;
/* XXX 1-byte hole */
/* 10 | 0 */ OffsetNumber offsets[];
/* XXX 2-byte padding */

/* total size (bytes): 12 */
}

(gdb) ptype /o struct gistxlogDelete
/* offset | size */ type = struct gistxlogDelete {
/* 0 | 4 */ TransactionId snapshotConflictHorizon;
/* 4 | 2 */ uint16 ntodelete;
/* 6 | 1 */ _Bool isCatalogRel;
/* XXX 1-byte hole */
/* 8 | 0 */ OffsetNumber offsets[];

/* total size (bytes): 8 */
}

While looking at it, I've a question: xl_hash_vacuum_one_page.ntuples is an int, do you see any reason why it is not an uint16? (we would get rid of 4 bytes in the struct).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v35-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v35-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From a765103d88411e344bc2b05897631a3e69526467 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 21 Dec 2022 14:14:10 +0000
Subject: [PATCH v35 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d5a81f9d83..ac8b169ab5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9662e316c9..8c8dbe812f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b2abd75ddb..3f6059805a 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index f3398425d8..0afd57ecac 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 8336a6e719..550ef3107f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 0b2100be4a..30c2cf35ae 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v35-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v35-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 5251836e6629428356fbeb55403314b062c45a05 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 21 Dec 2022 14:13:13 +0000
Subject: [PATCH v35 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v35-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v35-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 881931d378abbcc6ef22c70741c41c520d297dd4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 21 Dec 2022 14:12:25 +0000
Subject: [PATCH v35 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7411188dc8..171dc85388 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b8c3c104ae..81913bdfd6 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v35-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v35-0003-Allow-logical-decoding-on-standby.patchDownload

From d34314a69b427cced53fae25beaa47e261e4450e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 21 Dec 2022 14:11:18 +0000
Subject: [PATCH v35 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fca6ee4584..f9cc842a6a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..c210721ab0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..a9567f2d8c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6a4e2cd19b..f554dac6fd 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 64fbd52e34..9662e316c9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..027e155e8e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v35-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v35-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 41f90ad9debcd83c4c64da680a27da97bc5bbee1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 21 Dec 2022 14:10:34 +0000
Subject: [PATCH v35 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.9% doc/src/sgml/
   5.3% src/backend/access/transam/
   3.1% src/backend/access/
   3.9% src/backend/replication/logical/
  59.0% src/backend/replication/
   6.7% src/backend/storage/ipc/
   8.0% src/backend/tcop/
   3.4% src/backend/
   5.6% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 363b183e5f..27235418a6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4317,6 +4317,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 62ff149446..adb896101e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b66603c2e7..9c95d88989 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1b344eace7..1116eb3e3a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8693,6 +8693,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8862,6 +8863,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING,
 											rlocator);
 
 	/*
@@ -9117,6 +9119,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3e311a98a6..cfede906a3 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 44adc2098f..20165bc588 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..fca6ee4584 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7958,6 +7958,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..0e0b8ef415 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5c23178570..8432de219b 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..6a4e2cd19b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..64fbd52e34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..d68b752c91 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..1b3bf943c1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfda..f78cf5de68 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -499,6 +501,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (isCatalogRel)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -507,6 +512,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -525,7 +531,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1486,6 +1494,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 01d264b5ab..05da83bf5b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 290086fc22..7a8909d8b9 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 46f98fd67f..41eb6256ea 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 98d90d9338..21dd65a483 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5546,6 +5546,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a3df8d27c3..7ffce84d07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 65f2c74239..0ed1d8af28 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..56096bd3e2 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index e46c934c56..7df66d6136 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v35-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v35-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 141bee42ca8d655da399b19f1d301bfb04d185ea Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 21 Dec 2022 14:03:19 +0000
Subject: [PATCH v35 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      | 11 ++---
 src/backend/access/hash/hash_xlog.c     | 12 ++---
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  5 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           | 11 +++--
 src/include/access/hash_xlog.h          |  8 +--
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 22 files changed, 217 insertions(+), 44 deletions(-)
  25.6% contrib/test_decoding/expected/
  10.9% contrib/test_decoding/sql/
   4.2% doc/src/sgml/
   3.7% src/backend/access/gist/
   3.7% src/backend/access/hash/
   5.3% src/backend/access/heap/
  14.9% src/backend/commands/
   5.2% src/backend/
  20.8% src/include/access/
   4.3% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..459539b761 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..4b41f5e68d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index cb5affa3d2..62ff149446 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -608,6 +602,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index b452697a2f..b66603c2e7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 9a921e341e..06c2659068 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 42756a9e6d..1b344eace7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6827,6 +6827,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		snapshotConflictHorizon = FreezeLimit;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -8244,7 +8245,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8256,6 +8257,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..184e5123af 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e2..0bd73f4d9f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 65aa44893c..426a5df4fb 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index ad90b213b9..2e62e3fa3b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..f7540f4101 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 56dc995713..fd8200e670 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 33f1c7e31b..3abf3945c7 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 6dafb4a598..8d14c4f3c6 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool        isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c77290eec..68cacd532a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3b2d959c69..fbeb9cfbe0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 82332cb694..2ec0931a12 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 2803ef5c07..6005df3278 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index f853846ee1..dd16431378 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca9..5d41ef6505 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#196

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#195)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 12/22/22 8:50 AM, Drouvot, Bertrand wrote:

Hi,

On 12/21/22 10:06 AM, Drouvot, Bertrand wrote:

Hi,

On 12/20/22 10:41 PM, Robert Haas wrote:

On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
I guess whatever else we
do here, we should fix the comments.

Agree, please find attached a patch proposal doing so.

Bottom line is that I think the two cases that have alignment issues
as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
else is OK, as far as I can tell right now.

Thanks a lot for the repro(s) and explanations! That's very useful/helpful.

Based on your discovery about the wrong comments above, I'm now tempted to fix those 2 alignment issues
by using a FLEXIBLE_ARRAY_MEMBER within those structs (as you proposed in [1]) (as that should also prevent
any possible wrong comments about where the array is located).

What do you think?

As mentioned above, It looks to me that making use of a FLEXIBLE_ARRAY_MEMBER is a good choice.
So, please find attached v35 making use of a FLEXIBLE_ARRAY_MEMBER in xl_hash_vacuum_one_page and gistxlogDelete (your 2 repros are not failing anymore).
I've also added a few words in the commit message in 0001 about it.

So, we end up with:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset      |    size */ type = struct xl_hash_vacuum_one_page {
/*      0      |       4 */    TransactionId snapshotConflictHorizon;
/*      4      |       4 */    int ntuples;
/*      8      |       1 */    _Bool isCatalogRel;
/* XXX 1-byte hole      */
/*     10      |       0 */    OffsetNumber offsets[];
/* XXX 2-byte padding   */

                               /* total size (bytes):   12 */
                             }

(gdb) ptype /o struct gistxlogDelete
/* offset      |    size */ type = struct gistxlogDelete {
/*      0      |       4 */    TransactionId snapshotConflictHorizon;
/*      4      |       2 */    uint16 ntodelete;
/*      6      |       1 */    _Bool isCatalogRel;
/* XXX 1-byte hole      */
/*      8      |       0 */    OffsetNumber offsets[];

                               /* total size (bytes):    8 */
                             }

While looking at it, I've a question: xl_hash_vacuum_one_page.ntuples is an int, do you see any reason why it is not an uint16? (we would get rid of 4 bytes in the struct).

Please find attached v36, tiny rebase due to 1de58df4fe.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v36-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v36-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 05a2391a5eaf0ce98612900915933690ad99e5b8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 3 Jan 2023 06:52:31 +0000
Subject: [PATCH v36 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..98c96eb864 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e89c210a8e..b0b6d6ffc7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index f9574e800f..8ac9a614e2 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -457,6 +457,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f53254ad1f..1eb441dffe 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -128,6 +128,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v36-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v36-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 724ea62f3c6865fdf1868318e523fe60a2a38bdc Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 3 Jan 2023 06:51:44 +0000
Subject: [PATCH v36 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dccc..9acf16037a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v36-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v36-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 94d8cffe7d8f0d3e0f616d7df75dab2a79344e0b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 3 Jan 2023 06:50:58 +0000
Subject: [PATCH v36 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..6f3c9a6910 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v36-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v36-0003-Allow-logical-decoding-on-standby.patchDownload

From f08339f6afd31f60fd455d9a4b093e245b4e8f4b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 3 Jan 2023 06:50:00 +0000
Subject: [PATCH v36 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 73 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 118 insertions(+), 53 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 39cb993b92..780e5a59b0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c3bc8ecc92..40c80fc8dc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 52d1fe6269..b313aa93b6 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1946d07062..ddc7ebebf2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 87ab467446..e89c210a8e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v36-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v36-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 658dcb7ffdabb0c01fb25b3ee3825451d342d4d1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 3 Jan 2023 06:49:12 +0000
Subject: [PATCH v36 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  13 ++
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |   7 +-
 src/backend/replication/slot.c                | 209 ++++++++++++++++++
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procarray.c           |   4 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  22 ++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   2 +
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 318 insertions(+), 5 deletions(-)
   3.9% doc/src/sgml/
   5.3% src/backend/access/transam/
   3.1% src/backend/access/
   3.9% src/backend/replication/logical/
  59.0% src/backend/replication/
   6.7% src/backend/storage/ipc/
   8.0% src/backend/tcop/
   3.4% src/backend/
   5.6% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5bcba0fdec..5a378ce8ae 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4324,6 +4324,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 59e31fcc12..0cc9a7858a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 12edd69442..dd29de990d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8711,6 +8711,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8880,6 +8881,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING,
 											rlocator);
 
 	/*
@@ -9135,6 +9137,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..39cb993b92 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7958,6 +7958,19 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateConflictingLogicalReplicationSlots(InvalidOid,InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..6080c17ac4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..da3da4373e 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
 		 * errmsg because that'd be confusingly ambiguous about no changes
 		 * being available.
 		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("can no longer get changes from replication slot \"%s\"",
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..1946d07062 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1432,6 +1432,215 @@ restart:
 	return invalidated;
 }
 
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do this with the spinlock held to
+		 * avoid race conditions -- for example the xmin(s) could move forward
+		 * , or the slot could be dropped.
+		 */
+		SpinLockAcquire(&s->mutex);
+
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/*
+		 * If the slot is already invalid or is not conflicting, we don't need to
+		 * do anything.
+		 */
+
+		/* slot has been invalidated */
+		if ((!TransactionIdIsValid(slot_xmin) && !TransactionIdIsValid(slot_catalog_xmin))
+			||
+		/*
+		 * we are not forcing for invalidation because the xid is valid
+		 * and this is a non conflicting slot
+		 */
+			(TransactionIdIsValid(xid) && !(
+				(TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
+				||
+				(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
+				))
+			)
+		{
+			SpinLockRelease(&s->mutex);
+			if (released_lock)
+				LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.xmin = InvalidTransactionId;
+			s->data.catalog_xmin = InvalidTransactionId;
+		}
+
+		SpinLockRelease(&s->mutex);
+
+		if (active_pid != 0)
+		{
+			/*
+			 * Prepare the sleep on the slot's condition variable before
+			 * releasing the lock, to close a possible race condition if the
+			 * slot is released before the sleep below.
+			 */
+
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+				ereport(LOG,
+						(errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+								active_pid, NameStr(slotname))));
+
+				(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				last_signaled_pid = active_pid;
+			}
+
+			/* Wait until the slot is released. */
+			ConditionVariableSleep(&s->active_cv,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/*
+			 * Re-acquire lock and start over; we expect to invalidate the
+			 * slot next time (unless another process acquires the slot in the
+			 * meantime).
+			 */
+			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated it; flush it
+			 * to ensure that state persists.
+			 *
+			 * Don't want to hold ReplicationSlotControlLock across file
+			 * system operations, so release it now but be sure to tell caller
+			 * to restart from scratch.
+			 */
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/* Make sure the invalidated state persists across server restart */
+			ReplicationSlotMarkDirty();
+			ReplicationSlotSave();
+			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
+			/* done with this slot for now */
+			break;
+		}
+	}
+
+	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+	return released_lock;
+}
+
+/*
+ * Resolve recovery conflicts with logical slots.
+ *
+ * When xid is valid, it means that we are about to remove rows older than xid.
+ * Therefore we need to invalidate slots that depend on seeing those rows.
+ * When xid is invalid, invalidate all logical slots. This is required when the
+ * master wal_level is set back to replica, so existing logical slots need to
+ * be invalidated.
+ */
+void
+InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid)
+{
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+restart:
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+		if (!s->in_use)
+			continue;
+
+		/* We are only dealing with *logical* slot conflicts. */
+		if (!SlotIsLogical(s))
+			continue;
+
+		/* not our database and we don't want all the database, skip */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			continue;
+
+		if (InvalidatePossiblyConflictingLogicalReplicationSlot(s, xid))
+		{
+			/* if the lock was released, we need to restart from scratch */
+			goto restart;
+		}
+	}
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 015ae2995d..87ab467446 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4340bf9641..c7e0415a19 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
 
+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 43fbbdbc86..60c7703d28 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -669,6 +669,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..6c1989101b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (isCatalogRel)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 31479e8212..e134870cba 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..3ce69a4bbc 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7be9a50147..e057516898 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5546,6 +5546,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..64dc4e99ed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..ca7e9418f0 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -216,6 +216,7 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateConflictingLogicalReplicationSlots(Oid dboid, TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +228,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 45e2f1fe9d..6e58a3b3fb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v36-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v36-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 1d8261c4e75091d4a29147ea9f570b58da09f05e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 3 Jan 2023 06:44:53 +0000
Subject: [PATCH v36 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      | 11 ++---
 src/backend/access/hash/hash_xlog.c     | 12 ++---
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  5 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           | 11 +++--
 src/include/access/hash_xlog.h          |  8 +--
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 22 files changed, 217 insertions(+), 44 deletions(-)
  25.6% contrib/test_decoding/expected/
  10.9% contrib/test_decoding/sql/
   4.2% doc/src/sgml/
   3.7% src/backend/access/gist/
   3.7% src/backend/access/hash/
   5.3% src/backend/access/heap/
  14.9% src/backend/commands/
   5.2% src/backend/
  20.8% src/include/access/
   4.3% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..459539b761 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101ad..f5368e3a5b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..59e31fcc12 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -608,6 +602,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 85533e12a1..12edd69442 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6830,6 +6830,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8262,7 +8263,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8274,6 +8275,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a9..045c61edb8 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edc4fe866a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..afd9275a10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e6579f2979..a038400fe1 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1db3bd9e2e..092749d103 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..4a79e0c0a4 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool        isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..4323bc9641 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index b0592571da..f5f5de1603 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..2ef192c169 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#197

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#196)

Re: Minimal logical decoding on standbys

On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached v36, tiny rebase due to 1de58df4fe.

0001 looks committable to me now, though we probably shouldn't do that
unless we're pretty confident about shipping enough of the rest of
this to accomplish something useful.

--
Robert Haas
EDB: http://www.enterprisedb.com

#198

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#197)

Re: Minimal logical decoding on standbys

Hi,

Thomas, there's one point at the bottom wrt ConditionVariables that'd be
interesting for you to comment on.

On 2023-01-05 16:15:39 -0500, Robert Haas wrote:

On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached v36, tiny rebase due to 1de58df4fe.

0001 looks committable to me now, though we probably shouldn't do that
unless we're pretty confident about shipping enough of the rest of
this to accomplish something useful.

Cool!

ISTM that the ordering of patches isn't quite right later on. ISTM that it
doesn't make sense to introduce working logic decoding without first fixing
WalSndWaitForWal() (i.e. patch 0006). What made you order the patches that
way?

0001:

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

The startup process can't access catalog contents in the first place, so the
consistency issue is secondary.

ISTM that the commit message omits a fairly significant portion of the change:
The introduction of indisusercatalog / the reason for its introduction.

Why is indisusercatalog stored as "full" column, whereas we store the fact of
table being used as a catalog table in a reloption? I'm not adverse to moving
to a full column, but then I think we should do the same for tables.

Earlier version of the patches IIRC sourced the "catalogness" from the
relation. What lead you to changing that? I'm not saying it's wrong, just not
sure it's right either.

It'd be good to introduce cross-checks that indisusercatalog is set
correctly. RelationGetIndexList() seems like a good candidate.

I'd probably split the introduction of indisusercatalog into a separate patch.

Why was HEAP_DEFAULT_USER_CATALOG_TABLE introduced in this patch?

I wonder if we instead should compute a relation's "catalogness" in the
relcache. That'd would have the advantage of making
RelationIsUsedAsCatalogTable() cheaper and working for all kinds of
relations.

VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING is a very long
identifier. Given that the field in the xlog records is just named
isCatalogRel, any reason to not just name it correspondingly?

0002:

+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)

This appears to be a near complete copy of InvalidatePossiblyObsoleteSlot(). I
don't think we should have two versions of that non-trivial code. Seems we
could just have an additional argument for InvalidatePossiblyObsoleteSlot()?

+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+

I think this should report more details, similar to what
InvalidateObsoleteReplicationSlots() does.

--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin

/*
* After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
* errmsg because that'd be confusingly ambiguous about no changes
* being available.
*/
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer get changes from replication slot \"%s\"",

Hm. Feels like we should introduce a helper like SlotIsInvalidated() instead
of having this condition in a bunch of places.

+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+

This is a more precise error than the one in
pg_logical_slot_get_changes_guts().

I think both places should output the same error. ISTM that the relevant code
should be in CreateDecodingContext(). Imo the code to deal with the WAL
version of this has been misplaced...

--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,

GET_VXID_FROM_PGPROC(procvxid, *proc);

+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
if (procvxid.backendId == vxid.backendId &&
procvxid.localTransactionId == vxid.localTransactionId)
{

I can't really parse the comment.

@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
true);
+
+	if (isCatalogRel)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
}

Might be worth checking if wal_level >= logical before the somewhat expensive
InvalidateConflictingLogicalReplicationSlots().

@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}

/*
* If we aren't in a transaction any longer then ignore.

Why does the walsender need to be killed? I think it might just be that
IsTransactionOrTransactionBlock() might return false, even though we want to
cancel. The code actually seems to only cancel (QueryCancelPending is set
rather than ProcDiePending), but the comment talks about killing?

0003:

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

I think the commit message might be outdated, the next commit is a test.

+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}

Is this actually needed? Supposedly xlogreader can work just fixe with an
address at the start of a page?

/*
* Caller supplied a position to start at.
*
* In this case, NextRecPtr should already be pointing either to a
* valid record starting position or alternatively to the beginning of
* a page. See the header comments for XLogBeginRead.
*/
Assert(RecPtr % XLOG_BLCKSZ == 0 || XRecOffIsValid(RecPtr));

/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
*/
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+

This seems to remove some content from the !recovery case.

It's a bit odd that here RecoveryInProgress() is used, whereas further down
am_cascading_walsender is used.

@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
* If first time through in this session, initialize flushPtr.  Otherwise,
* we only need to update flushPtr if EndRecPtr is past it.
*/
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}

/* If EndRecPtr is still past our flushPtr, it means we caught up. */
if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)

A short if inside a normal if seems ugly to me.

0004:

@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {

=pod

+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{

Any reason this has to be standby specific?

+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+

Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.

+ print "starting pg_recvlogical\n";

I don't think tests should just print somewhere. Either diag() or note()
should be used.

+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}

That comment placement imo is quite odd.

+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');

Maybe mention that it's 2 transactions + 10 rows?

+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));

There's enough copies of this that I wonder if we shouldn't introduce a
Cluster.pm level helper for this.

+print "waiting to replay $endpos\n";

See above.

+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');

I don't think this should use a hardcoded 180 but
$PostgreSQL::Test::Utils::timeout_default.

+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;

IIRC a reload should suffice.

+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');

Can we do something cheaper than rewriting the entire database? Seems
rewriting a single table ought to be sufficient?

I think it'd also be good to test that rewriting a non-catalog table doesn't
trigger an issue.

+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);

This is a fair bit of repeated setup, maybe put it into a function?

I think it'd be good to test the ongoing decoding via the SQL interface also
gets correctly handled. But it might be too hard to do reliably.

+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+

I think this also should test the streaming / walsender case.

0006:

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..98c96eb864 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
RecoveryPauseState recoveryPauseState;
ConditionVariable recoveryNotPausedCV;

+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
slock_t		info_lck;		/* locks shared variables shown above */
} XLogRecoveryCtlData;

getReplayedCV() doesn't seem to fit into any of the naming scheems in use for
xlogrecovery.h.

-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.

Hm. Is there a problem with not handling the becoming-writable case in the
in-recovery case?

+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{

I don't think pgindent will like that formatting....

+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}

I think this is racy, see ConditionVariablePrepareToSleep()'s comment:

* Caution: "before entering the loop" means you *must* test the exit
* condition between calling ConditionVariablePrepareToSleep and calling
* ConditionVariableSleep. If that is inconvenient, omit calling
* ConditionVariablePrepareToSleep.

Basically, the ConditionVariablePrepareToSleep() should be before the loop
body.

I don't think the fixed timeout here makes sense. For one, we need to wake up
based on WalSndComputeSleeptime(), otherwise we're ignoring wal_sender_timeout
(which can be quite small). It's also just way too frequent - we're trying to
avoid constantly waking up unnecessarily.

Perhaps we could deal with the pq_is_send_pending() issue by having a version
of ConditionVariableTimedSleep() that accepts a WaitEventSet?

Greetings,

Andres Freund

#199

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Andres Freund (#198)

Re: Minimal logical decoding on standbys

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:

Hi,

Thomas, there's one point at the bottom wrt ConditionVariables that'd be
interesting for you to comment on.

On 2023-01-05 16:15:39 -0500, Robert Haas wrote:

On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached v36, tiny rebase due to 1de58df4fe.

0001 looks committable to me now, though we probably shouldn't do that
unless we're pretty confident about shipping enough of the rest of
this to accomplish something useful.

Thanks for your precious help reaching this state!

Cool!

ISTM that the ordering of patches isn't quite right later on. ISTM that it
doesn't make sense to introduce working logic decoding without first fixing
WalSndWaitForWal() (i.e. patch 0006). What made you order the patches that
way?

Idea was to ease the review: 0001 to 0005 to introduce the feature and 0006 to deal
with this race condition.

I thought it would be easier to review that way (given the complexity of "just" adding the
feature itself).

0001:

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency.

The startup process can't access catalog contents in the first place, so the
consistency issue is secondary.

Thanks for pointing out, I'll update the commit message.

ISTM that the commit message omits a fairly significant portion of the change:
The introduction of indisusercatalog / the reason for its introduction.

Agree, will do (or create a dedicated path as you are suggesting below).

Why is indisusercatalog stored as "full" column, whereas we store the fact of
table being used as a catalog table in a reloption? I'm not adverse to moving
to a full column, but then I think we should do the same for tables.

Earlier version of the patches IIRC sourced the "catalogness" from the
relation. What lead you to changing that? I'm not saying it's wrong, just not
sure it's right either.

That's right it's started retrieving this information from the relation.

Then, Robert made a comment in [1]/messages/by-id/CA+TgmobgOLH-JpBoBSdu4i+sjRdgwmDEZGECkmowXqQgQL6WhQ@mail.gmail.com saying it's not safe to call
table_open() while holding a buffer lock.

Then, I worked on other options and submitted the current one.

While reviewing 0001, Robert's also thought of it (see [2]/messages/by-id/CA+TgmoY0df9X+5ENg8P0BGj0odhM45sdQ7kB4JMo4NKaoFy-Vg@mail.gmail.com)) and finished with:

"
So while I do not really like the approach of storing the same
property in different ways for tables and for indexes, it's also not
really obvious to me how to do better.
"

That's also my thought.

It'd be good to introduce cross-checks that indisusercatalog is set
correctly. RelationGetIndexList() seems like a good candidate.

Good point, will look at it.

I'd probably split the introduction of indisusercatalog into a separate patch.

You mean, completely outside of this patch series or a sub-patch in this series?
If the former, I'm not sure it would make sense outside of the current context.

Why was HEAP_DEFAULT_USER_CATALOG_TABLE introduced in this patch?

To help in case of reset on the table (ensure the default gets also propagated to the indexes).

I wonder if we instead should compute a relation's "catalogness" in the
relcache. That'd would have the advantage of making
RelationIsUsedAsCatalogTable() cheaper and working for all kinds of
relations.

Any idea on where and how you'd do that? (that's one option I explored in vain before
submitting the current proposal).

It does look like that's also an option explored by Robert in [2]/messages/by-id/CA+TgmoY0df9X+5ENg8P0BGj0odhM45sdQ7kB4JMo4NKaoFy-Vg@mail.gmail.com:

"
Yet a third way is to have the index fetch the flag from
the associated table, perhaps when the relcache entry is built. But I
see no existing precedent for that in RelationInitIndexAccessInfo,
which I think is where it would be if we had it -- and that makes me
suspect that there might be good reasons why this isn't actually safe.
"

VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING is a very long
identifier. Given that the field in the xlog records is just named
isCatalogRel, any reason to not just name it correspondingly?

Agree, VISIBILITYMAP_IS_CATALOG_REL maybe?

I'll look at the other comments too and work on/reply later on.

[1]: /messages/by-id/CA+TgmobgOLH-JpBoBSdu4i+sjRdgwmDEZGECkmowXqQgQL6WhQ@mail.gmail.com
[2]: /messages/by-id/CA+TgmoY0df9X+5ENg8P0BGj0odhM45sdQ7kB4JMo4NKaoFy-Vg@mail.gmail.com

Thanks for your help,
Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#200

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Andres Freund (#198)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:

Hi,
0002:
+/*
+ * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released in the interim (and
+ * in that case we're not holding the lock at return, otherwise we are).
+ *
+ * This is inherently racy, because we release the LWLock
+ * for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
This appears to be a near complete copy of InvalidatePossiblyObsoleteSlot(). I
don't think we should have two versions of that non-trivial code. Seems we
could just have an additional argument for InvalidatePossiblyObsoleteSlot()?

Good point, done in V37 attached.
The new logical slot invalidation handling has been "merged" with the obsolete LSN case into 2 new
functions (InvalidateObsoleteOrConflictingLogicalReplicationSlots() and InvalidatePossiblyObsoleteOrConflictingLogicalSlot()),
removing InvalidateObsoleteReplicationSlots() and InvalidatePossiblyObsoleteSlot().

+			ereport(LOG,
+					(errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
+
I think this should report more details, similar to what
InvalidateObsoleteReplicationSlots() does.

Agree, done in V37 (adding more details about the xid horizon and wal_level < logical on master cases).

--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin

/*
* After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid.
+		 * Avoid "cannot get changes" wording in this
* errmsg because that'd be confusingly ambiguous about no changes
* being available.
*/
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
+			|| (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+				&& !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer get changes from replication slot \"%s\"",

Hm. Feels like we should introduce a helper like SlotIsInvalidated() instead
of having this condition in a bunch of places.

Agree, LogicalReplicationSlotIsInvalid() has been added in V37.

+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+

This is a more precise error than the one in
pg_logical_slot_get_changes_guts().

I think both places should output the same error.

Agree, done in V37 attached.

ISTM that the relevant code
should be in CreateDecodingContext(). Imo the code to deal with the WAL
version of this has been misplaced...

Looks like a good idea. I'll start a dedicated thread to move the already existing
error reporting code part of pg_logical_slot_get_changes_guts() and StartLogicalReplication() into CreateDecodingContext().

--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,

GET_VXID_FROM_PGPROC(procvxid, *proc);

+		/*
+		 * Note: vxid.localTransactionId can be invalid, which means the
+		 * request is to signal the pid that is not running a transaction.
+		 */
if (procvxid.backendId == vxid.backendId &&
procvxid.localTransactionId == vxid.localTransactionId)
{

I can't really parse the comment.

Looks like it's there since a long time ago (before I started working on this thread).
I did not pay that much attention to it, but now that you say it I'm not
sure why it as been previously added.

Given that 1) I don't get it too and 2) that this comment is the only one modification in this file then V37 just removes it.

@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
true);
+
+	if (isCatalogRel)
+		InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
}
Might be worth checking if wal_level >= logical before the somewhat expensive
InvalidateConflictingLogicalReplicationSlots().

Good point, done in V37.

@@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/*
+				 * For conflicts that require a logical slot to be invalidated, the
+				 * requirement is for the signal receiver to release the slot,
+				 * so that it could be invalidated by the signal sender. So for
+				 * normal backends, the transaction should be aborted, just
+				 * like for other recovery conflicts. But if it's walsender on
+				 * standby, then it has to be killed so as to release an
+				 * acquired logical slot.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}

/*
* If we aren't in a transaction any longer then ignore.

Oh right, this comment is also there since a long time ago. I think the code
is OK (as we break in that case and so we don't go through IsTransactionOrTransactionBlock()).

So, V37 just modifies the comment.

Please find attached, V37 taking care of:

0001: commit message modifications and renaming VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING
to VISIBILITYMAP_IS_CATALOG_REL (It does not touch the other remarks as they are still discussed in [1]/messages/by-id/5c5151a6-a1a3-6c38-7d68-543c9faa22f4@gmail.com).

0002: All your remarks mentioned above.

I'll look at the ones you've done in [2]/messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de on 0003, 0004 and 0006.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: /messages/by-id/5c5151a6-a1a3-6c38-7d68-543c9faa22f4@gmail.com
[2]: /messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de

Attachments:

v37-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v37-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 6b389418bbe1172d72b1a17f2997b3f0c5b7bfa5 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 10 Jan 2023 07:53:30 +0000
Subject: [PATCH v37 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..98c96eb864 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e89c210a8e..b0b6d6ffc7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v37-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v37-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 0ece8f9b76bdaa4d08b0a9311791c2d0030565ea Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 10 Jan 2023 07:52:27 +0000
Subject: [PATCH v37 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4cf863a76f..0387558d75 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v37-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v37-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From dbbd9fad6102a602878fd1c38ba05764185dd337 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 10 Jan 2023 07:51:42 +0000
Subject: [PATCH v37 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..6f3c9a6910 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v37-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v37-0003-Allow-logical-decoding-on-standby.patchDownload

From 014c2411ef116199bca76e335f72ec94b17bbce6 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 10 Jan 2023 07:50:43 +0000
Subject: [PATCH v37 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 ++++
 src/backend/replication/logical/decode.c  | 22 ++++++-
 src/backend/replication/logical/logical.c | 37 +++++++-----
 src/backend/replication/slot.c            | 71 +++++++++++++++--------
 src/backend/replication/walsender.c       | 27 +++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 117 insertions(+), 52 deletions(-)
   4.5% src/backend/access/transam/
  36.6% src/backend/replication/logical/
  57.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8625942516..edbead2970 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..c1e43dd2b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 52d1fe6269..b313aa93b6 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f22572be30..971cb2bd8c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,46 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
 		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
 		}
 		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1231,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 87ab467446..e89c210a8e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3174,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v37-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v37-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 4d7c76f42f1a70fa21496e64966e879e860a24e1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 10 Jan 2023 07:49:53 +0000
Subject: [PATCH v37 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 191 +++++++++++++-----
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 22 files changed, 268 insertions(+), 55 deletions(-)
   3.4% doc/src/sgml/
   8.5% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  56.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   5.5% src/backend/
   3.5% src/include/replication/
   3.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..aabf74478d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4339,6 +4339,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 59e31fcc12..0cc9a7858a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d0733923d4..b06fa69764 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8752,6 +8752,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8921,6 +8922,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9176,6 +9178,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..8625942516 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6442,6 +6442,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6802,7 +6803,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7081,6 +7083,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7246,7 +7249,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7958,6 +7962,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..6080c17ac4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..f22572be30 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1224,20 +1224,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteOrConflictingLogicalReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1246,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1265,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,11 +1311,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
-			/* Let caller know */
-			*invalidated = true;
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+				/* Let caller know */
+				*invalidated = true;
+			}
 		}
 
 		SpinLockRelease(&s->mutex);
@@ -1327,15 +1353,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1419,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1458,38 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 beeninvalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		*beeninvalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1500,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, beeninvalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any obsolete slots have been invalidated, recalculate the resource
+	 * limits.
 	 */
-	if (invalidated)
+	if (!xid && *beeninvalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 015ae2995d..87ab467446 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..daba766947 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 224ab290af..9e06140f13 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..3ce69a4bbc 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..01f4ffef9a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5550,6 +5550,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..64dc4e99ed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..d392b5eec5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v37-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v37-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From e1a200852687e0a52ef3e5c957a36ad50463b5fc Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 10 Jan 2023 07:44:00 +0000
Subject: [PATCH v37 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

hen a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

To introduce the new isCatalogRel field for indexes, indisusercatalog has
been added to pg_index. It allows us to check if there is a risk of conflict
on indexes (without having to table_open() the linked table and so prevent
any risk of deadlock on it.)

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      | 11 ++---
 src/backend/access/hash/hash_xlog.c     | 12 ++---
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  5 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           | 11 +++--
 src/include/access/hash_xlog.h          |  8 +--
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 22 files changed, 217 insertions(+), 44 deletions(-)
  25.7% contrib/test_decoding/expected/
  11.0% contrib/test_decoding/sql/
   4.3% doc/src/sgml/
   3.7% src/backend/access/gist/
   3.7% src/backend/access/hash/
   5.1% src/backend/access/heap/
  14.9% src/backend/commands/
   5.2% src/backend/
  20.6% src/include/access/
   4.3% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..22616a0579 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101ad..f5368e3a5b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..59e31fcc12 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -608,6 +602,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 63c4f01f0f..d0733923d4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6871,6 +6871,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8303,7 +8304,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8315,6 +8316,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a9..045c61edb8 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edc4fe866a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..afd9275a10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e6579f2979..a038400fe1 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1db3bd9e2e..092749d103 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..4a79e0c0a4 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool        isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index b0592571da..f5f5de1603 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..2ef192c169 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#201

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#200)

Re: Minimal logical decoding on standbys

On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached, V37 taking care of:

Thanks. I started to digest the design specified in the commit message
and these patches. Here are some quick comments:

1. Does logical decoding on standby work without any issues if the
standby is set for cascading replication?

2. Do logical decoding output plugins work without any issues on the
standby with decoding enabled, say, when the slot is invalidated?

3. Is this feature still a 'minimal logical decoding on standby'?
Firstly, why is it 'minimal'?

4. What happens in case of failover to the standby that's already
decoding for its clients? Will the decoding work seamlessly? If not,
what are recommended things that users need to take care of
during/after failovers?

0002:
1.
-    if (InvalidateObsoleteReplicationSlots(_logSegNo))
+    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
&invalidated, InvalidOid, NULL);

Isn't the function name too long and verbose? How about just
InvalidateLogicalReplicationSlots() let the function comment talk
about what sorts of replication slots it invalides?

2.
+                                errdetail("Logical decoding on
standby requires wal_level to be at least logical on master"));
+ *     master wal_level is set back to replica, so existing logical
slots need to
invalidate such slots. Also do the same thing if wal_level on master

Can we use 'primary server' instead of 'master' like elsewhere? This
comment also applies for other patches too, if any.

3. Can we show a new status in pg_get_replication_slots's wal_status
for invalidated due to the conflict so that the user can monitor for
the new status and take necessary actions?

4. How will the user be notified when logical replication slots are
invalidated due to conflict with the primary server? How can they
(firstly, is there a way?) repair/restore such replication slots? Or
is recreating/reestablishing logical replication only the way out for
them? What users can do to avoid their logical replication slots
getting invalidated and run into these situations? Because
recreating/reestablishing logical replication comes with cost
(sometimes huge) as it involves building another instance, syncing
tables etc. Isn't it a good idea to touch up on all these aspects in
the documentation at least as to why we're doing this and why we can't
do this?

5.
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)

ReplicationSlotAcquire(cmd->slotname, true);

+    if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+         && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("cannot read from logical replication slot \"%s\"",
+                        cmd->slotname),
+                 errdetail("This slot has been invalidated because it
was conflicting with recovery.")));

Having the invalidation check in StartLogicalReplication() looks fine,
however, what happens if the slot gets invalidated when the
replication is in-progress? Do we need to error out in WalSndLoop() or
XLogSendLogical() too? Or is it already taken care of somewhere?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#202

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Bharath Rupireddy (#201)

Re: Minimal logical decoding on standbys

Hi,

On 1/11/23 8:32 AM, Bharath Rupireddy wrote:

On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached, V37 taking care of:

Thanks. I started to digest the design specified in the commit message
and these patches.

Thanks for looking at it!

Here are some quick comments:

1. Does logical decoding on standby work without any issues if the
standby is set for cascading replication?

Without "any issues" is hard to guarantee ;-) But according to my tests:

Primary -> Standby1 with or without logical replication slot -> Standby2 with or without logical replication slot

works as expected (and also with cascading promotion).
We can add some TAP tests in 0004 though.

2. Do logical decoding output plugins work without any issues on the
standby with decoding enabled, say, when the slot is invalidated?

Not sure, I got the question.
If the slot is invalidated then it's expected to get errors like:

pg_recvlogical: error: unexpected termination of replication stream: ERROR: canceling statement due to conflict with recovery
DETAIL: User was using the logical slot that must be dropped.

pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "bdt_slot" LOGICAL 0/0": ERROR: cannot read from logical replication slot "bdt_slot"
DETAIL: This slot has been invalidated because it was conflicting with recovery.

3. Is this feature still a 'minimal logical decoding on standby'?
Firstly, why is it 'minimal'?

Good question and I don't have the answer.
That's how it has been named when this thread started back in 2018.

4. What happens in case of failover to the standby that's already
decoding for its clients? Will the decoding work seamlessly? If not,
what are recommended things that users need to take care of
during/after failovers?

Yes, it's expected to work seamlessly. There is a TAP test in
0004 for this scenario.

0002:
1.
-    if (InvalidateObsoleteReplicationSlots(_logSegNo))
+    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
&invalidated, InvalidOid, NULL);
Isn't the function name too long and verbose? How about just
InvalidateLogicalReplicationSlots()

The function also takes care of Invalidation of Physical replication slots
that are Obsolete (aka LSN case).

InvalidateObsoleteOrConflictingReplicationSlots() maybe?

let the function comment talk
about what sorts of replication slots it invalides?

Agree to make the comment more clear.

2.
+                                errdetail("Logical decoding on
standby requires wal_level to be at least logical on master"));
+ *     master wal_level is set back to replica, so existing logical
slots need to
invalidate such slots. Also do the same thing if wal_level on master

Can we use 'primary server' instead of 'master' like elsewhere? This
comment also applies for other patches too, if any.

Sure.

3. Can we show a new status in pg_get_replication_slots's wal_status
for invalidated due to the conflict so that the user can monitor for
the new status and take necessary actions?

Not sure you've seen but the patch series is adding a new field (confl_active_logicalslot) in
pg_stat_database_conflicts.

That said, I like your idea about adding a new status in pg_replication_slots too.
Do you think it's mandatory for this patch series? (I mean it could be added once this patch series is committed).

I'm asking because this patch series looks already like a "big" one, is more than 4 years old
and I'm afraid of adding more "reporting" stuff to it (unless we feel a strong need for it of course).

4. How will the user be notified when logical replication slots are
invalidated due to conflict with the primary server?

Emitting messages, like the ones mentioned above introduced in 0002.

How can they
(firstly, is there a way?) repair/restore such replication slots? Or
is recreating/reestablishing logical replication only the way out for
them?

Drop/recreate is what is part of the current design and discussed up-thread IIRC.

What users can do to avoid their logical replication slots
getting invalidated and run into these situations? Because
recreating/reestablishing logical replication comes with cost
(sometimes huge) as it involves building another instance, syncing
tables etc. Isn't it a good idea to touch up on all these aspects in
the documentation at least as to why we're doing this and why we can't
do this?

0005 adds a few words about it:

+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>

5.
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)

ReplicationSlotAcquire(cmd->slotname, true);
+    if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+         && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("cannot read from logical replication slot \"%s\"",
+                        cmd->slotname),
+                 errdetail("This slot has been invalidated because it
was conflicting with recovery.")));
Having the invalidation check in StartLogicalReplication() looks fine,
however, what happens if the slot gets invalidated when the
replication is in-progress? Do we need to error out in WalSndLoop() or
XLogSendLogical() too? Or is it already taken care of somewhere?

Yes, it's already taken care in pg_logical_slot_get_changes_guts().

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#203

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Bharath Rupireddy (#201)

Re: Minimal logical decoding on standbys

Hi,

On 2023-01-11 13:02:13 +0530, Bharath Rupireddy wrote:

3. Is this feature still a 'minimal logical decoding on standby'?
Firstly, why is it 'minimal'?

It's minimal in comparison to other proposals at the time that did explicit /
active coordination between primary and standby to allow logical decoding.

0002:
1.
-    if (InvalidateObsoleteReplicationSlots(_logSegNo))
+    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
&invalidated, InvalidOid, NULL);

Isn't the function name too long and verbose?

How about just InvalidateLogicalReplicationSlots() let the function comment
talk about what sorts of replication slots it invalides?

I'd just leave the name unmodified at InvalidateObsoleteReplicationSlots().

2.
+                                errdetail("Logical decoding on
standby requires wal_level to be at least logical on master"));
+ *     master wal_level is set back to replica, so existing logical
slots need to
invalidate such slots. Also do the same thing if wal_level on master

Can we use 'primary server' instead of 'master' like elsewhere? This
comment also applies for other patches too, if any.

3. Can we show a new status in pg_get_replication_slots's wal_status
for invalidated due to the conflict so that the user can monitor for
the new status and take necessary actions?

Invalidated slots are not a new concept introduced in this patchset, so I'd
say we can introduce such a field separately.

Greetings,

Andres Freund

#204

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Andres Freund (#198)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:

Hi,
On 2023-01-05 16:15:39 -0500, Robert Haas wrote:

On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

0003:

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary. Conflicting slots
would be handled in next commits.

I think the commit message might be outdated, the next commit is a test.

Oops, thanks, fixed in V38 attached.

+			/*
+			 * Replay pointer may point one past the end of the record. If that
+			 * is a XLOG page boundary, it will not be a valid LSN for the
+			 * start of a record, so bump it up past the page header.
+			 */
+			if (!XRecOffIsValid(restart_lsn))
+			{
+				if (restart_lsn % XLOG_BLCKSZ != 0)
+					elog(ERROR, "invalid replay pointer");
+
+				/* For the first page of a segment file, it's a long header */
+				if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
+					restart_lsn += SizeOfXLogLongPHD;
+				else
+					restart_lsn += SizeOfXLogShortPHD;
+			}
Is this actually needed? Supposedly xlogreader can work just fixe with an
address at the start of a page?

/*
* Caller supplied a position to start at.
*
* In this case, NextRecPtr should already be pointing either to a
* valid record starting position or alternatively to the beginning of
* a page. See the header comments for XLogBeginRead.
*/
Assert(RecPtr % XLOG_BLCKSZ == 0 || XRecOffIsValid(RecPtr));

Oh you're right, thanks, indeed that's not needed anymore now that XLogDecodeNextRecord() exists.
Removed in V38.

/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
*/
+	if (!RecoveryInProgress())
+		currTLI = GetWALInsertionTimeLine();
+	else
+		GetXLogReplayRecPtr(&currTLI);
+

This seems to remove some content from the !recovery case.

It's a bit odd that here RecoveryInProgress() is used, whereas further down
am_cascading_walsender is used.

Agree, using am_cascading_walsender instead in V38.

@@ -3074,10 +3078,12 @@ XLogSendLogical(void)
* If first time through in this session, initialize flushPtr.  Otherwise,
* we only need to update flushPtr if EndRecPtr is past it.
*/
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		flushPtr = (am_cascading_walsender ?
+					GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
+	}

/* If EndRecPtr is still past our flushPtr, it means we caught up. */
if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)

A short if inside a normal if seems ugly to me.

Using 2 normal if in v38.

Please find V38 attached, I'll look at the other comments you've done in [1]/messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de on 0004 and 0006.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: /messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de

Attachments:

v38-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v38-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 79abc1be123fa73c1d4a4625db88de0ca859987d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 11 Jan 2023 17:22:12 +0000
Subject: [PATCH v38 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5e65785306..57fa6d68d0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fd66b209b6..63e824a12f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v38-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v38-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From 8d996fba629ec796d84c85d947e3ecb1f615a7d7 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 11 Jan 2023 17:21:12 +0000
Subject: [PATCH v38 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4cf863a76f..0387558d75 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v38-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v38-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 4c67d9a98707ef8d0d7f2370dbeb4631cb729a12 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 11 Jan 2023 17:20:19 +0000
Subject: [PATCH v38 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..6f3c9a6910 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v38-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v38-0003-Allow-logical-decoding-on-standby.patchDownload

From d388a875ed4f8d4fa37fb4e9406b71da766cb9d0 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 11 Jan 2023 17:18:37 +0000
Subject: [PATCH v38 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 29 ++++++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 103 insertions(+), 54 deletions(-)
   5.0% src/backend/access/transam/
  40.0% src/backend/replication/logical/
  54.0% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8625942516..edbead2970 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..c1e43dd2b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 52d1fe6269..b313aa93b6 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f22572be30..1f7a686cb1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1213,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 87ab467446..fd66b209b6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3176,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v38-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v38-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 7be898806c610e1d5cc9b2fa43920010f062c48c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 11 Jan 2023 17:17:22 +0000
Subject: [PATCH v38 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 191 +++++++++++++-----
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 22 files changed, 268 insertions(+), 55 deletions(-)
   3.4% doc/src/sgml/
   8.5% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  56.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   5.5% src/backend/
   3.5% src/include/replication/
   3.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..aabf74478d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4339,6 +4339,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 59e31fcc12..0cc9a7858a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d0733923d4..b06fa69764 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8752,6 +8752,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8921,6 +8922,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9176,6 +9178,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..8625942516 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6442,6 +6442,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6802,7 +6803,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7081,6 +7083,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7246,7 +7249,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7958,6 +7962,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..6080c17ac4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..f22572be30 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1224,20 +1224,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteOrConflictingLogicalReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1246,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1265,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,11 +1311,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
-			/* Let caller know */
-			*invalidated = true;
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+				/* Let caller know */
+				*invalidated = true;
+			}
 		}
 
 		SpinLockRelease(&s->mutex);
@@ -1327,15 +1353,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1419,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1458,38 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 beeninvalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		*beeninvalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1500,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, beeninvalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any obsolete slots have been invalidated, recalculate the resource
+	 * limits.
 	 */
-	if (invalidated)
+	if (!xid && *beeninvalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 015ae2995d..87ab467446 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..daba766947 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 224ab290af..9e06140f13 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2482,6 +2482,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3051,6 +3054,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..3ce69a4bbc 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..01f4ffef9a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5550,6 +5550,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..64dc4e99ed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..d392b5eec5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v38-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v38-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 4a1d6223aa18c5ab9fa15d800b5608dd8b6c81e1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 11 Jan 2023 17:15:37 +0000
Subject: [PATCH v38 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

To introduce the new isCatalogRel field for indexes, indisusercatalog has
been added to pg_index. It allows us to check if there is a risk of conflict
on indexes (without having to table_open() the linked table and so prevent
any risk of deadlock on it.)

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      | 11 ++---
 src/backend/access/hash/hash_xlog.c     | 12 ++---
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  5 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           | 11 +++--
 src/include/access/hash_xlog.h          |  8 +--
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 22 files changed, 217 insertions(+), 44 deletions(-)
  25.7% contrib/test_decoding/expected/
  11.0% contrib/test_decoding/sql/
   4.3% doc/src/sgml/
   3.7% src/backend/access/gist/
   3.7% src/backend/access/hash/
   5.1% src/backend/access/heap/
  14.9% src/backend/commands/
   5.2% src/backend/
  20.6% src/include/access/
   4.3% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..22616a0579 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101ad..f5368e3a5b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..59e31fcc12 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -608,6 +602,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 63c4f01f0f..d0733923d4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6871,6 +6871,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8303,7 +8304,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8315,6 +8316,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a9..045c61edb8 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edc4fe866a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..afd9275a10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e6579f2979..a038400fe1 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1db3bd9e2e..092749d103 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14148,6 +14149,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14214,7 +14219,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14242,6 +14246,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14268,6 +14286,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..4a79e0c0a4 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool        isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index b0592571da..f5f5de1603 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..2ef192c169 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#205

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Drouvot, Bertrand (#199)

Re: Minimal logical decoding on standbys

Hi,

On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:

On 1/6/23 4:40 AM, Andres Freund wrote:

ISTM that the ordering of patches isn't quite right later on. ISTM that it
doesn't make sense to introduce working logic decoding without first fixing
WalSndWaitForWal() (i.e. patch 0006). What made you order the patches that
way?

Idea was to ease the review: 0001 to 0005 to introduce the feature and 0006 to deal
with this race condition.

I thought it would be easier to review that way (given the complexity of "just" adding the
feature itself).

The problem I have with that is that I saw a lot of flakiness in the tests due
to the race condition. So introducing them in that order just doesn't make a
whole lot of sense to me. It's also something that can be committed
independently, I think.

Why is indisusercatalog stored as "full" column, whereas we store the fact of
table being used as a catalog table in a reloption? I'm not adverse to moving
to a full column, but then I think we should do the same for tables.

Earlier version of the patches IIRC sourced the "catalogness" from the
relation. What lead you to changing that? I'm not saying it's wrong, just not
sure it's right either.

That's right it's started retrieving this information from the relation.

Then, Robert made a comment in [1] saying it's not safe to call
table_open() while holding a buffer lock.

The suggested path in earlier versions to avoid doing so was to make sure that
we pass down the Relation for the table into the necessary functions. Did you
explore that any further?

Then, I worked on other options and submitted the current one.

While reviewing 0001, Robert's also thought of it (see [2])) and finished with:

"
So while I do not really like the approach of storing the same
property in different ways for tables and for indexes, it's also not
really obvious to me how to do better.
"

That's also my thought.

I still dislike this approach. The code for cascading the change to the index
attributes is complicated. RelationIsAccessibleInLogicalDecoding() is getting
slower. We add unnecessary storage space to all pg_index rows.

Now I even wonder if this doesn't break the pg_index.indcheckxmin logic (which
I really dislike, but that's a separate discussion). I think updating pg_index
to set indisusercatalog might cause the index to be considered unusable, if
indcheckxmin = true. See

/*
* If the index is valid, but cannot yet be used, ignore it; but
* mark the plan we are generating as transient. See
* src/backend/access/heap/README.HOT for discussion.
*/
if (index->indcheckxmin &&
!TransactionIdPrecedes(HeapTupleHeaderGetXmin(indexRelation->rd_indextuple->t_data),
TransactionXmin))
{
root->glob->transientPlan = true;
index_close(indexRelation, NoLock);
continue;
}

The reason we went with the indcheckxmin approach, instead of storing the xmin
after which an index is uable directly, was that that way we don't need
special logic around vacuum to reset the stored xid to prevent the index to
become unusable after xid wraparound. But these days we could just store a
64bit xid...

Greetings,

Andres Freund

#206

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#204)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/11/23 7:04 PM, Drouvot, Bertrand wrote:

Hi,

Please find V38 attached, I'll look at the other comments you've done in [1] on 0004 and 0006.

Please find attached V39, tiny rebase due to 50767705ed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v39-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v39-0006-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 14d76316d831616641bd3320957660a73ba8eeb1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 12 Jan 2023 11:14:31 +0000
Subject: [PATCH v39 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5e65785306..57fa6d68d0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fd66b209b6..63e824a12f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v39-0005-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v39-0005-Doc-changes-describing-details-about-logical-dec.patchDownload

From c5f4e9b5e03a53c584c0b735d836b40157dd7690 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 12 Jan 2023 11:13:44 +0000
Subject: [PATCH v39 5/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4cf863a76f..0387558d75 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v39-0004-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v39-0004-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 26351e823b1450c7b13e73a0d685ec56a62f6a2f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 12 Jan 2023 11:12:59 +0000
Subject: [PATCH v39 4/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..6f3c9a6910 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v39-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v39-0003-Allow-logical-decoding-on-standby.patchDownload

From 48950d6751b873bed57467d186a8d3d2d786e229 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 12 Jan 2023 11:12:09 +0000
Subject: [PATCH v39 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 29 ++++++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 103 insertions(+), 54 deletions(-)
   5.0% src/backend/access/transam/
  40.0% src/backend/replication/logical/
  54.0% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8625942516..edbead2970 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4462,6 +4462,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..c1e43dd2b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..93a4fcf15a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f22572be30..1f7a686cb1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1213,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 87ab467446..fd66b209b6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3176,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v39-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v39-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 6ae23298e2aec1616382f87e9846722c6246b6d7 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 12 Jan 2023 11:11:29 +0000
Subject: [PATCH v39 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 191 +++++++++++++-----
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 22 files changed, 268 insertions(+), 55 deletions(-)
   3.4% doc/src/sgml/
   8.5% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  56.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   5.5% src/backend/
   3.5% src/include/replication/
   3.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..aabf74478d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4339,6 +4339,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 59e31fcc12..0cc9a7858a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0e37bad213..b204dfe130 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8890,6 +8890,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9059,6 +9060,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9176,6 +9178,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..8625942516 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6442,6 +6442,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6802,7 +6803,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7081,6 +7083,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7246,7 +7249,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7958,6 +7962,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..6080c17ac4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..f22572be30 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1224,20 +1224,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteOrConflictingLogicalReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1246,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1265,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,11 +1311,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
-			/* Let caller know */
-			*invalidated = true;
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+				/* Let caller know */
+				*invalidated = true;
+			}
 		}
 
 		SpinLockRelease(&s->mutex);
@@ -1327,15 +1353,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1419,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1458,38 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 beeninvalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		*beeninvalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1500,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, beeninvalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any obsolete slots have been invalidated, recalculate the resource
+	 * limits.
 	 */
-	if (invalidated)
+	if (!xid && *beeninvalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 015ae2995d..87ab467446 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..daba766947 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..3ce69a4bbc 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..01f4ffef9a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5550,6 +5550,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..64dc4e99ed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..d392b5eec5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..1cc62c447d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v39-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v39-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From f083a9bdf091f2092dd2aff89fa1c5b415ab7d73 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 12 Jan 2023 11:09:24 +0000
Subject: [PATCH v39 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

To introduce the new isCatalogRel field for indexes, indisusercatalog has
been added to pg_index. It allows us to check if there is a risk of conflict
on indexes (without having to table_open() the linked table and so prevent
any risk of deadlock on it.)

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 65 +++++++++++++++++++++++++
 contrib/test_decoding/sql/ddl.sql       | 23 +++++++++
 doc/src/sgml/catalogs.sgml              | 11 +++++
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c      | 11 ++---
 src/backend/access/hash/hash_xlog.c     | 12 ++---
 src/backend/access/hash/hashinsert.c    |  1 +
 src/backend/access/heap/heapam.c        |  5 +-
 src/backend/access/heap/pruneheap.c     |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c     |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c             | 10 ++--
 src/backend/commands/tablecmds.c        | 55 ++++++++++++++++++++-
 src/include/access/gistxlog.h           | 11 +++--
 src/include/access/hash_xlog.h          |  8 +--
 src/include/access/heapam_xlog.h        |  8 +--
 src/include/access/nbtxlog.h            |  6 ++-
 src/include/access/spgxlog.h            |  1 +
 src/include/access/visibilitymapdefs.h  |  9 ++--
 src/include/catalog/pg_index.h          |  2 +
 src/include/utils/rel.h                 | 14 +++++-
 22 files changed, 217 insertions(+), 44 deletions(-)
  25.7% contrib/test_decoding/expected/
  11.0% contrib/test_decoding/sql/
   4.3% doc/src/sgml/
   3.7% src/backend/access/gist/
   3.7% src/backend/access/hash/
   5.1% src/backend/access/heap/
  14.9% src/backend/commands/
   5.2% src/backend/
  20.6% src/include/access/
   4.3% src/include/utils/

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 9a28b5ddc5..48fb44c575 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -483,6 +483,7 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -492,11 +493,19 @@ WITH (user_catalog_table = true)
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -506,10 +515,19 @@ ALTER TABLE replication_metadata RESET (user_catalog_table);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
                                                  Table "public.replication_metadata"
   Column  |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -519,15 +537,52 @@ ALTER TABLE replication_metadata SET (user_catalog_table = true);
  options  | text[]  |           |          |                                                  | extended |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
 Options: user_catalog_table=true
 
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_and 
+----------
+ t
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+                                                 Table "public.replication_metadata_false"
+  Column  |  Type   | Collation | Nullable |                        Default                         | Storage  | Stats target | Description 
+----------+---------+-----------+----------+--------------------------------------------------------+----------+--------------+-------------
+ id       | integer |           | not null | nextval('replication_metadata_false_id_seq'::regclass) | plain    |              | 
+ relation | name    |           | not null |                                                        | plain    |              | 
+ options  | text[]  |           |          |                                                        | extended |              | 
+Indexes:
+    "replication_metadata_false_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_false_idx1" btree (relation)
+Options: user_catalog_table=false
+
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 ERROR:  cannot rewrite table "replication_metadata" used as a catalog table
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
                                                     Table "public.replication_metadata"
      Column     |  Type   | Collation | Nullable |                     Default                      | Storage  | Stats target | Description 
@@ -538,8 +593,18 @@ ALTER TABLE replication_metadata SET (user_catalog_table = false);
  rewritemeornot | integer |           |          |                                                  | plain    |              | 
 Indexes:
     "replication_metadata_pkey" PRIMARY KEY, btree (id)
+    "replication_metadata_idx1" btree (relation)
+    "replication_metadata_idx2" btree (relation)
+    "replication_metadata_idx3" btree (relation)
+    "replication_metadata_idx4" btree (relation)
 Options: user_catalog_table=false
 
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
+ bool_or 
+---------
+ f
+(1 row)
+
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 4f76bed72c..51baac5c4e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -276,29 +276,52 @@ CREATE TABLE replication_metadata (
 )
 WITH (user_catalog_table = true)
 ;
+
+CREATE INDEX replication_metadata_idx1 on replication_metadata(relation);
+
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('foo', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata RESET (user_catalog_table);
+CREATE INDEX replication_metadata_idx2 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('bar', ARRAY['a', 'b']);
 
 ALTER TABLE replication_metadata SET (user_catalog_table = true);
+CREATE INDEX replication_metadata_idx3 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_and(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('blub', NULL);
 
+-- Also checking that indisusercatalog is set correctly when a table is created with user_catalog_table = false
+CREATE TABLE replication_metadata_false (
+    id serial primary key,
+    relation name NOT NULL,
+    options text[]
+)
+WITH (user_catalog_table = false)
+;
+
+CREATE INDEX replication_metadata_false_idx1 on replication_metadata_false(relation);
+\d+ replication_metadata_false
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata_false'::regclass;
+
 -- make sure rewrites don't work
 ALTER TABLE replication_metadata ADD COLUMN rewritemeornot int;
 ALTER TABLE replication_metadata ALTER COLUMN rewritemeornot TYPE text;
 
 ALTER TABLE replication_metadata SET (user_catalog_table = false);
+CREATE INDEX replication_metadata_idx4 on replication_metadata(relation);
 \d+ replication_metadata
+SELECT bool_or(indisusercatalog) from pg_index where indrelid = 'replication_metadata'::regclass;
 
 INSERT INTO replication_metadata(relation, options)
 VALUES ('zaphod', NULL);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..22616a0579 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4447,6 +4447,17 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>indisusercatalog</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the index is linked to a table that is declared as an additional
+       catalog table for purposes of logical replication (means has <link linkend="sql-createtable"><literal>user_catalog_table</literal></link>)
+       set to true.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>indisreplident</structfield> <type>bool</type>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101ad..f5368e3a5b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -120,7 +120,7 @@ static relopt_bool boolRelOpts[] =
 			RELOPT_KIND_HEAP,
 			AccessExclusiveLock
 		},
-		false
+		HEAP_DEFAULT_USER_CATALOG_TABLE
 	},
 	{
 		{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..59e31fcc12 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -608,6 +602,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 388df94a44..0e37bad213 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6871,6 +6871,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8441,7 +8442,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8453,6 +8454,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a9..045c61edb8 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edc4fe866a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..afd9275a10 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(index);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e6579f2979..a038400fe1 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -123,7 +123,8 @@ static void UpdateIndexRelation(Oid indexoid, Oid heapoid,
 								bool isexclusion,
 								bool immediate,
 								bool isvalid,
-								bool isready);
+								bool isready,
+								bool is_user_catalog);
 static void index_update_stats(Relation rel,
 							   bool hasindex,
 							   double reltuples);
@@ -545,7 +546,8 @@ UpdateIndexRelation(Oid indexoid,
 					bool isexclusion,
 					bool immediate,
 					bool isvalid,
-					bool isready)
+					bool isready,
+					bool is_user_catalog)
 {
 	int2vector *indkey;
 	oidvector  *indcollation;
@@ -622,6 +624,7 @@ UpdateIndexRelation(Oid indexoid,
 	values[Anum_pg_index_indcheckxmin - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indisready - 1] = BoolGetDatum(isready);
 	values[Anum_pg_index_indislive - 1] = BoolGetDatum(true);
+	values[Anum_pg_index_indisusercatalog - 1] = BoolGetDatum(is_user_catalog);
 	values[Anum_pg_index_indisreplident - 1] = BoolGetDatum(false);
 	values[Anum_pg_index_indkey - 1] = PointerGetDatum(indkey);
 	values[Anum_pg_index_indcollation - 1] = PointerGetDatum(indcollation);
@@ -1020,7 +1023,8 @@ index_create(Relation heapRelation,
 						isprimary, is_exclusion,
 						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
 						!concurrent && !invalid,
-						!concurrent);
+						!concurrent,
+						RelationIsUsedAsCatalogTable(heapRelation));
 
 	/*
 	 * Register relcache invalidation on the indexes' heap relation, to
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1fbdad4b64..327fcbcc6c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -103,6 +103,7 @@
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
+#include "utils/rel.h"
 
 /*
  * ON COMMIT action list
@@ -14162,6 +14163,10 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	Datum		repl_val[Natts_pg_class];
 	bool		repl_null[Natts_pg_class];
 	bool		repl_repl[Natts_pg_class];
+	ListCell   *cell;
+	List	   *rel_options;
+	bool		catalog_table_val = HEAP_DEFAULT_USER_CATALOG_TABLE;
+	bool		catalog_table = false;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 
 	if (defList == NIL && operation != AT_ReplaceRelOptions)
@@ -14228,7 +14233,6 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 	{
 		Query	   *view_query = get_view_query(rel);
 		List	   *view_options = untransformRelOptions(newOptions);
-		ListCell   *cell;
 		bool		check_option = false;
 
 		foreach(cell, view_options)
@@ -14256,6 +14260,20 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		}
 	}
 
+	/* If user_catalog_table is part of the new options, record its new value */
+	rel_options = untransformRelOptions(newOptions);
+
+	foreach(cell, rel_options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(cell);
+
+		if (strcmp(defel->defname, "user_catalog_table") == 0)
+		{
+			catalog_table = true;
+			catalog_table_val = defGetBoolean(defel);
+		}
+	}
+
 	/*
 	 * All we need do here is update the pg_class row; the new options will be
 	 * propagated into relcaches during post-commit cache inval.
@@ -14282,6 +14300,41 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 
 	ReleaseSysCache(tuple);
 
+	/* Update the indexes if there is a need to */
+	if (catalog_table || operation == AT_ResetRelOptions)
+	{
+		Relation	pg_index;
+		HeapTuple	pg_index_tuple;
+		Form_pg_index pg_index_form;
+		ListCell   *index;
+
+		pg_index = table_open(IndexRelationId, RowExclusiveLock);
+
+		foreach(index, RelationGetIndexList(rel))
+		{
+			Oid			thisIndexOid = lfirst_oid(index);
+
+			pg_index_tuple = SearchSysCacheCopy1(INDEXRELID,
+												 ObjectIdGetDatum(thisIndexOid));
+			if (!HeapTupleIsValid(pg_index_tuple))
+				elog(ERROR, "cache lookup failed for index %u", thisIndexOid);
+			pg_index_form = (Form_pg_index) GETSTRUCT(pg_index_tuple);
+
+			/* Modify the index only if user_catalog_table differ */
+			if (catalog_table_val != pg_index_form->indisusercatalog)
+			{
+				pg_index_form->indisusercatalog = catalog_table_val;
+				CatalogTupleUpdate(pg_index, &pg_index_tuple->t_self, pg_index_tuple);
+				InvokeObjectPostAlterHookArg(IndexRelationId, thisIndexOid, 0,
+											 InvalidOid, true);
+			}
+
+			heap_freetuple(pg_index_tuple);
+		}
+
+		table_close(pg_index, RowExclusiveLock);
+	}
+
 	/* repeat the whole exercise for the toast table, if there's one */
 	if (OidIsValid(rel->rd_rel->reltoastrelid))
 	{
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..4a79e0c0a4 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool        isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index b0592571da..f5f5de1603 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -43,6 +43,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
 	bool		indcheckxmin;	/* must we wait for xmin to be old? */
 	bool		indisready;		/* is this index ready for inserts? */
 	bool		indislive;		/* is this index alive at all? */
+	bool		indisusercatalog;	/* is this index linked to a user catalog
+									 * relation? */
 	bool		indisreplident; /* is this index the identity for replication? */
 
 	/* variable-length fields start here, but we allow direct access to indkey */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..2ef192c169 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -343,6 +344,7 @@ typedef struct StdRdOptions
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
+#define HEAP_DEFAULT_USER_CATALOG_TABLE		false
 
 /*
  * RelationGetToastTupleTarget
@@ -385,6 +387,15 @@ typedef struct StdRdOptions
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * IndexIsLinkedToUserCatalogTable
+ *		Returns whether the relation should be treated as an index linked to
+ *		a user catalog table from the pov of logical decoding.
+ */
+#define IndexIsLinkedToUserCatalogTable(relation)	\
+	((relation)->rd_rel->relkind == RELKIND_INDEX && \
+	 (relation)->rd_index->indisusercatalog)
+
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
@@ -682,7 +693,8 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation) || \
+	  IndexIsLinkedToUserCatalogTable(relation)))
 
 /*
  * RelationIsLogicallyLogged
-- 
2.34.1

#207

Ashutosh Sharma

ashu.coek88@gmail.com

about 3 years ago

In reply to: Drouvot, Bertrand (#206)

Re: Minimal logical decoding on standbys

Hi,

On Thu, Jan 12, 2023 at 5:29 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On 1/11/23 7:04 PM, Drouvot, Bertrand wrote:

Hi,

Please find V38 attached, I'll look at the other comments you've done in [1] on 0004 and 0006.

Sorry for joining late. I totally missed it. AFAICU, with this patch
users can now do LR from standby, previously they could only do it
from the primary server.

To start with, I have one small question:

I previously participated in the discussion on "Synchronizing the
logical replication slots from Primary to Standby" and one of the
purposes of that project was to synchronize logical slots from primary
to standby so that if failover occurs, it will not affect the logical
subscribers of the old primary much. Can someone help me understand
how we are going to solve this problem with this patch? Are we going
to encourage users to do LR from standby instead of primary to get rid
of such problems during failover?

Also, one small observation:

I just played around with the latest (v38) patch a bit and found that
when a new logical subscriber of standby is created, it actually
creates two logical replication slots for it on the standby server.
May I know the reason for creating an extra replication slot other
than the one created by create subscription command? See below:

Subscriber:
=========
create subscription t1_sub connection 'host=127.0.0.1 port=38500
dbname=postgres user=ashu' publication t1_pub;

May I know the reason for creating pg_16399_sync_16392_7187728548042694423?

--
With Regards,
Ashutosh Sharma.

#208

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Ashutosh Sharma (#207)

Re: Minimal logical decoding on standbys

Hi,

On 2023-01-12 20:08:55 +0530, Ashutosh Sharma wrote:

I previously participated in the discussion on "Synchronizing the
logical replication slots from Primary to Standby" and one of the
purposes of that project was to synchronize logical slots from primary
to standby so that if failover occurs, it will not affect the logical
subscribers of the old primary much. Can someone help me understand
how we are going to solve this problem with this patch? Are we going
to encourage users to do LR from standby instead of primary to get rid
of such problems during failover?

It only provides a building block towards that. The "Synchronizing the logical
replication slots from Primary to Standby" project IMO needs all of the
infrastructure in this patch. With the patch, a logical rep solution can
e.g. maintain one slot on the primary and one on the standby, and occasionally
forward the slot on the standby to the position of the slot on the primary. In
case of a failover it can just start consuming changes from the former
standby, all the necessary changes are guaranteed to be present.

Also, one small observation:

I just played around with the latest (v38) patch a bit and found that
when a new logical subscriber of standby is created, it actually
creates two logical replication slots for it on the standby server.
May I know the reason for creating an extra replication slot other
than the one created by create subscription command? See below:

That's unrelated to this patch. There's no changes to the "higher level"
logical replication code dealing with pubs and subs, it's all on the "logical
decoding" level.

I think this because logical rep wants to be able to concurrently perform
ongoing replication, and synchronize tables added to the replication set. The
pg_16399_sync_16392_7187728548042694423 slot should vanish after the initial
synchronization.

Greetings,

Andres Freund

#209

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#205)

Re: Minimal logical decoding on standbys

Hi,

On 1/11/23 9:27 PM, Andres Freund wrote:

Hi,

On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:

The problem I have with that is that I saw a lot of flakiness in the tests due
to the race condition. So introducing them in that order just doesn't make a
whole lot of sense to me.

You are right it does not make sense to introduce fixing the race condition after the TAP tests
and after introducing the decoding logic. I'll reorder the sub-patches.

It's also something that can be committed
independently, I think.

Right but could this race condition occur outside of the context of this new feature?

That's right it's started retrieving this information from the relation.

Then, Robert made a comment in [1] saying it's not safe to call
table_open() while holding a buffer lock.

The suggested path in earlier versions to avoid doing so was to make sure that
we pass down the Relation for the table into the necessary functions. Did you
explore that any further?

So, for gistXLogPageReuse() and _bt_delitems_delete() this is "easy" to pass the Heap Relation.
This is what was done in earlier versions of this patch series.

But we would need to define a way to propagate the Heap Relation for those 2 functions:

_bt_log_reuse_page()
vacuumRedirectAndPlaceholder()

When I first looked at it and saw the number of places where _bt_getbuf() is called
then I preferred to have a look to the current proposal.

I will give it another look, also because I just realized that it could be beneficial
for vacuumRedirectAndPlaceholder() too, as per this comment:

"
/* XXX: providing heap relation would allow more pruning */
vistest = GlobalVisTestFor(NULL);
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#210

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#209)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/13/23 10:17 AM, Drouvot, Bertrand wrote:

Hi,

On 1/11/23 9:27 PM, Andres Freund wrote:

Hi,

On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:

The problem I have with that is that I saw a lot of flakiness in the tests due
to the race condition. So introducing them in that order just doesn't make a
whole lot of sense to me.

You are right it does not make sense to introduce fixing the race condition after the TAP tests
and after introducing the decoding logic. I'll reorder the sub-patches.

V40 attached is changing the sub-patches ordering.

The suggested path in earlier versions to avoid doing so was to make sure that
we pass down the Relation for the table into the necessary functions. Did you
explore that any further?

So, for gistXLogPageReuse() and _bt_delitems_delete() this is "easy" to pass the Heap Relation.
This is what was done in earlier versions of this patch series.

But we would need to define a way to propagate the Heap Relation for those 2 functions:

_bt_log_reuse_page()
vacuumRedirectAndPlaceholder()

V40 is getting rid of the new indisusercatalog field in pg_index and is passing the
heap relation all the way down to _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() instead
(and obviously to gistXLogPageReuse() and _bt_delitems_delete() too).

Remarks:

1) V40 adds the heap relation in the IndexVacuumInfo and ParallelVacuumState structs. It is used
for the _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() cases where I did not find any place
where to get the heap relation from in the existing code path.

2) V40 adds a "real" heap relation to all the _bt_getbuf() calls. Another option could have been
to add it only for the code paths leading to _bt_log_reuse_page() but I thought it is cleaner to
do it for all of them.

I will give it another look, also because I just realized that it could be beneficial
for vacuumRedirectAndPlaceholder() too, as per this comment:

"
/* XXX: providing heap relation would allow more pruning */
vistest = GlobalVisTestFor(NULL);
"

Now, we could also pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder().
Could be done in or independently of this patch series once committed (it's not part of V40).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v40-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v40-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From d32b456963c4d3f9949be401b3689cf4481ccbba Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 16 Jan 2023 14:21:39 +0000
Subject: [PATCH v40 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4cf863a76f..0387558d75 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v40-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v40-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From ae5b064fde8598e30b8b5d06a4c9310706c72f04 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 16 Jan 2023 14:20:50 +0000
Subject: [PATCH v40 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..6f3c9a6910 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v40-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v40-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From b21515e9c4e43bb03418858a05229df9d28cf424 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 16 Jan 2023 14:19:58 +0000
Subject: [PATCH v40 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5e65785306..57fa6d68d0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fd66b209b6..63e824a12f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v40-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v40-0003-Allow-logical-decoding-on-standby.patchDownload

From 5e8bd12e0bef49a66e4ad54caa3392e7597f5cbc Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 16 Jan 2023 14:19:15 +0000
Subject: [PATCH v40 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 29 ++++++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 103 insertions(+), 54 deletions(-)
   5.0% src/backend/access/transam/
  40.0% src/backend/replication/logical/
  54.0% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6ebfacf9..d1ee3493ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4463,6 +4463,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..c1e43dd2b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..93a4fcf15a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f22572be30..1f7a686cb1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1213,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 87ab467446..fd66b209b6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3176,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v40-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v40-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 1c4d85a442dc0bfc2f88f0a106a9ab4a01dbf5e2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 16 Jan 2023 14:18:32 +0000
Subject: [PATCH v40 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 191 +++++++++++++-----
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 22 files changed, 268 insertions(+), 55 deletions(-)
   3.4% doc/src/sgml/
   8.5% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  56.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   5.5% src/backend/
   3.5% src/include/replication/
   3.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..aabf74478d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4339,6 +4339,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d0cf1b2c81..5b1562fe0e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0e37bad213..b204dfe130 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8890,6 +8890,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9059,6 +9060,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9176,6 +9178,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f47fb7570..5b6ebfacf9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6445,6 +6445,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6805,7 +6806,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7084,6 +7086,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7249,7 +7252,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7967,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..bbeee1f339 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1064,7 +1064,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..f22572be30 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1224,20 +1224,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteOrConflictingLogicalReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1246,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1265,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,11 +1311,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
-			/* Let caller know */
-			*invalidated = true;
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+				/* Let caller know */
+				*invalidated = true;
+			}
 		}
 
 		SpinLockRelease(&s->mutex);
@@ -1327,15 +1353,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1419,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1458,38 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 beeninvalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		*beeninvalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1500,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, beeninvalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any obsolete slots have been invalidated, recalculate the resource
+	 * limits.
 	 */
-	if (invalidated)
+	if (!xid && *beeninvalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 015ae2995d..87ab467446 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..daba766947 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..aaf387863e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1055,6 +1055,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1088,6 +1090,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..01f4ffef9a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5550,6 +5550,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..d392b5eec5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..0a6ca08301 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
-- 
2.34.1

v40-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v40-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 436112498d6881b13a58852128d48a8b30218e8f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 16 Jan 2023 14:16:39 +0000
Subject: [PATCH v40 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/amcheck/verify_nbtree.c            | 17 ++--
 src/backend/access/gist/gist.c             |  2 +-
 src/backend/access/gist/gistbuild.c        |  2 +-
 src/backend/access/gist/gistutil.c         |  4 +-
 src/backend/access/gist/gistxlog.c         | 14 ++-
 src/backend/access/hash/hash_xlog.c        | 12 +--
 src/backend/access/hash/hashinsert.c       |  1 +
 src/backend/access/heap/heapam.c           |  5 +-
 src/backend/access/heap/heapam_handler.c   |  9 +-
 src/backend/access/heap/pruneheap.c        |  1 +
 src/backend/access/heap/vacuumlazy.c       |  2 +
 src/backend/access/heap/visibilitymap.c    |  3 +-
 src/backend/access/nbtree/nbtinsert.c      | 82 +++++++++---------
 src/backend/access/nbtree/nbtpage.c        | 99 ++++++++++++----------
 src/backend/access/nbtree/nbtree.c         |  4 +-
 src/backend/access/nbtree/nbtsearch.c      | 45 +++++-----
 src/backend/access/nbtree/nbtsort.c        |  2 +-
 src/backend/access/nbtree/nbtutils.c       |  7 +-
 src/backend/access/spgist/spgvacuum.c      |  9 +-
 src/backend/catalog/index.c                |  1 +
 src/backend/commands/analyze.c             |  1 +
 src/backend/commands/vacuumparallel.c      |  6 ++
 src/backend/optimizer/util/plancat.c       |  2 +-
 src/backend/utils/sort/tuplesortvariants.c |  7 +-
 src/include/access/genam.h                 |  1 +
 src/include/access/gist_private.h          |  4 +-
 src/include/access/gistxlog.h              | 11 +--
 src/include/access/hash_xlog.h             |  8 +-
 src/include/access/heapam_xlog.h           |  8 +-
 src/include/access/nbtree.h                | 31 ++++---
 src/include/access/nbtxlog.h               |  6 +-
 src/include/access/spgxlog.h               |  1 +
 src/include/access/visibilitymapdefs.h     |  9 +-
 src/include/utils/rel.h                    |  1 +
 src/include/utils/tuplesort.h              |  3 +-
 35 files changed, 234 insertions(+), 186 deletions(-)
   4.8% contrib/amcheck/
   5.0% src/backend/access/gist/
   5.3% src/backend/access/heap/
  55.8% src/backend/access/nbtree/
   5.0% src/backend/access/
   3.3% src/backend/
  19.7% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..8d3abbdceb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,8 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
-													IndexTuple itup);
+													IndexTuple itup,
+													Relation heaprel);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, &heapkeyspace, &allequalimage, heaprel);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup, state->heaprel);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup, state->heaprel);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, state->heaprel);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL, state->heaprel);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, heaprel);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..235f1a1843 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..a87890b965 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..119e34ce0f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heaprel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heaprel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..d0cf1b2c81 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation heaprel, Relation rel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 388df94a44..0e37bad213 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6871,6 +6871,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8441,7 +8442,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8453,6 +8454,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..30730c24bf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,11 +720,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
 											maintenance_work_mem,
-											NULL, TUPLESORT_NONE);
+											NULL, TUPLESORT_NONE, NewHeap);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3694515167..92a3b85f32 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2688,6 +2688,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2737,6 +2738,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a9..045c61edb8 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -283,8 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..c48b272431 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate,
+								 Relation heaprel);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,7 +42,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -50,14 +51,15 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Size itemsz,
 						   OffsetNumber newitemoff,
 						   int postingoff,
-						   bool split_only_page);
+						   bool split_only_page,
+						   Relation heaprel);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
 						IndexTuple newitem, IndexTuple orignewitem,
-						IndexTuple nposting, uint16 postingoff);
+						IndexTuple nposting, uint16 postingoff, Relation heaprel);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+							  BTStack stack, bool isroot, bool isonly, Relation heaprel);
+static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, heapRel);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, &insertstate, heapRel);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -257,7 +259,7 @@ search:
 									   indexUnchanged, stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
 					   itup, insertstate.itemsz, newitemoff,
-					   insertstate.postingoff, false);
+					   insertstate.postingoff, false, heapRel);
 	}
 	else
 	{
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, BTInsertState insertstate, Relation heaprel)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -376,7 +378,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
 	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+					  NULL, heaprel);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, insertstate, stack, heapRel);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, insertstate, stack, heapRel);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, rbuf, stack, heaprel);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1107,7 +1109,8 @@ _bt_insertonpg(Relation rel,
 			   Size itemsz,
 			   OffsetNumber newitemoff,
 			   int postingoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1210,7 +1213,7 @@ _bt_insertonpg(Relation rel,
 
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+						 origitup, nposting, postingoff, heaprel);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly, heaprel);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1461,7 +1464,8 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff,
+		  Relation heaprel)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1712,7 +1716,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1889,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE, heaprel);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2096,7 +2100,8 @@ _bt_insert_parent(Relation rel,
 				  Buffer rbuf,
 				  BTStack stack,
 				  bool isroot,
-				  bool isonly)
+				  bool isonly,
+				  Relation heaprel)
 {
 	/*
 	 * Here we have to do something Lehman and Yao don't talk about: deal with
@@ -2118,7 +2123,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, buf, rbuf, heaprel);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2162,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL,
+									heaprel);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2189,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, stack, bknum, heaprel);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2209,7 +2215,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
-					   stack->bts_offset + 1, 0, isonly);
+					   stack->bts_offset + 1, 0, isonly, heaprel);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2227,7 +2233,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack, Relation heaprel)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2246,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE, heaprel);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2258,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2275,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly, heaprel);
 }
 
 /*
@@ -2304,7 +2310,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child, Relation heaprel)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2324,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, blkno, BT_WRITE, heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, buf, stack->bts_parent, heaprel);
 			continue;
 		}
 
@@ -2428,7 +2434,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2460,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edca7aebb2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -39,16 +39,19 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   FullTransactionId safexid);
+							   FullTransactionId safexid,
+							   Relation heaprel);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heaprel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+								   BTStack stack,
+								   Relation heaprel);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
@@ -58,7 +61,8 @@ static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									Buffer *subtreeparent,
 									OffsetNumber *poffset,
 									BlockNumber *topparent,
-									BlockNumber *topparentrightsib);
+									BlockNumber *topparentrightsib,
+									Relation heaprel);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
 							   FullTransactionId safexid);
 
@@ -178,7 +182,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +195,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +235,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +259,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +344,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, int access, Relation heaprel)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +374,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ, heaprel);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +400,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +433,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, access, heaprel);
 		}
 
 		/*
@@ -437,7 +441,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +578,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +600,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +673,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +681,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +737,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +745,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +829,7 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid, Relation heaprel)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +840,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +873,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access, Relation heaprel)
 {
 	Buffer		buf;
 
@@ -944,7 +949,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 						_bt_log_reuse_page(rel, blkno,
-										   BTPageGetDeleteXid(page));
+										   BTPageGetDeleteXid(page), heaprel);
 
 					/* Okay to use page.  Re-initialize and return it. */
 					_bt_pageinit(page, BufferGetPageSize(buf));
@@ -1296,7 +1301,7 @@ static void
 _bt_delitems_delete(Relation rel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable, Relation heaprel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1363,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+						deletable, ndeletable, updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1712,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1723,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, leftsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1769,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1778,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, leafrightsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1967,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno, vstate->info->heaprel))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, vstate->info->heaprel);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL,
+								   vstate->info->heaprel);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2009,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, leafbuf, stack, vstate->info->heaprel))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2072,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	}
 }
 
@@ -2084,7 +2091,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack, Relation heaprel)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2126,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, leafrightsib, heaprel))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2145,7 +2152,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	topparentrightsib = leafrightsib;
 	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
 								 &subtreeparent, &poffset,
-								 &topparent, &topparentrightsib))
+								 &topparent, &topparentrightsib, heaprel))
 		return false;
 
 	/*
@@ -2363,7 +2370,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, target, BT_READ, vstate->info->heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2397,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2447,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2511,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2540,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE,
+								 vstate->info->heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2775,7 +2783,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 static bool
 _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+						BlockNumber *topparent, BlockNumber *topparentrightsib,
+						Relation heaprel)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2798,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, stack, child, heaprel);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,13 +2898,13 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, leftsibparent, parent, heaprel))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
 	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
-								   topparent, topparentrightsib);
+								   topparent, topparentrightsib, heaprel);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..705716e333 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, num_delpages, info->heaprel);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..6466fe2f58 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -94,13 +94,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  */
 BTStack
 _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+		   Snapshot snapshot, Relation heaprel)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, access, heaprel);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -130,7 +130,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+							  page_access, snapshot, heaprel);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -191,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * move right to its new sibling.  Do that.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
-							  snapshot);
+							  snapshot, heaprel);
 	}
 
 	return stack_in;
@@ -239,7 +239,8 @@ _bt_moveright(Relation rel,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
-			  Snapshot snapshot)
+			  Snapshot snapshot,
+			  Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -288,12 +289,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, buf, stack, heaprel);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access, heaprel);
 			continue;
 		}
 
@@ -860,6 +861,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1354,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage, heaprel);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1365,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot, heaprel);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2006,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ, scan->heapRelation);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2080,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ,
+										 scan->heapRelation);
 
 		for (;;)
 		{
@@ -2093,7 +2096,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 			/* Step to next physical page */
 			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+											scan->xs_snapshot, scan->heapRelation);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2143,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ,
+											 scan->heapRelation);
 			}
 		}
 	}
@@ -2185,7 +2189,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2217,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ, heaprel);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2305,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  */
 Buffer
 _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-				 Snapshot snapshot)
+				 Snapshot snapshot, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -2320,9 +2324,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, BT_READ, heaprel);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2407,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot,
+						   scan->heapRelation);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..542029eec7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, btspool->heap);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..9edd856371 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage, heaprel);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ,
+						 scan->heapRelation);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..a8fc221153 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Buffer buffer, Relation heaprel)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e6579f2979..9411730b36 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..7d9c1c7eca 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -208,7 +208,8 @@ Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem,
-						SortCoordinate coordinate, int sortopt)
+						SortCoordinate coordinate, int sortopt,
+						Relation heaprel)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heaprel);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heapRel);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..9bdac12baf 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..8f1dfedaaf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool		isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..cdcfdd6030 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack,
+							 Relation heaprel);
+extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child,
+							  Relation heaprel);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
+								 Relation heaprel);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
+extern Buffer _bt_getroot(Relation rel, int access, Relation heaprel);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
 extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
-							bool *allequalimage);
+							bool *allequalimage, Relation heaprel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access,
+						 Relation heaprel);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1230,20 +1234,21 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
+						  int access, Snapshot snapshot, Relation heaprel);
 extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+							bool forupdate, BTStack stack, int access,
+							Snapshot snapshot, Relation heaprel);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+							   Snapshot snapshot, Relation heaprel);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..06aebe6330 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -401,7 +401,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 											   Relation indexRel, int workMem,
 											   SortCoordinate coordinate,
-											   int sortopt);
+											   int sortopt,
+											   Relation heaprel);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
-- 
2.34.1

#211

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#198)

Re: Minimal logical decoding on standbys

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:

Hi,
0004:
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {

=pod
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
Any reason this has to be standby specific?

Due to the extra work to be done for this case (aka wait for restart_lsn
and trigger a checkpoint on the primary).

+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.

Not sure I got it. What the C helper would be supposed to do?

+ print "starting pg_recvlogical\n";

I don't think tests should just print somewhere. Either diag() or note()
should be used.

Will be done.

+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}

That comment placement imo is quite odd.

Agree, will be done.

+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
Maybe mention that it's 2 transactions + 10 rows?

Agree, will be done.

+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));

There's enough copies of this that I wonder if we shouldn't introduce a
Cluster.pm level helper for this.

Done in [1]/messages/by-id/846724b5-0723-f4c2-8b13-75301ec7509e@gmail.com.

+print "waiting to replay $endpos\n";

See above.

Will be done.

+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
I don't think this should use a hardcoded 180 but
$PostgreSQL::Test::Utils::timeout_default.

Agree, will be done.

+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;

IIRC a reload should suffice.

Right.

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are invalidated)
because the slots are not dropped yet. What do you think?

+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
Can we do something cheaper than rewriting the entire database? Seems
rewriting a single table ought to be sufficient?

While implementing the test at the table level I discovered that It looks like there is no guarantee that say a "vacuum full pg_class;" would
produce a conflict.

Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0:

"rmgr: Heap2 len (rec/tot): 107/ 107, tx: 848, lsn: 0/03B98B30, prev 0/03B98AF0, desc: PRUNE snapshotConflictHorizon 0"

Having a snapshotConflictHorizon to zero leads to ResolveRecoveryConflictWithSnapshot() simply returning
without any conflict handling.

It does look like that in the standby decoding case that's not the right behavior (and that the xid that generated the PRUNING should be used instead)
, what do you think?

I think it'd also be good to test that rewriting a non-catalog table doesn't
trigger an issue.

Good point, but need to understand the above first.

+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);

This is a fair bit of repeated setup, maybe put it into a function?

Yeah, good point: will be done.

I think it'd be good to test the ongoing decoding via the SQL interface also
gets correctly handled. But it might be too hard to do reliably.
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
I think this also should test the streaming / walsender case.

Do you mean cascading standby?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: /messages/by-id/846724b5-0723-f4c2-8b13-75301ec7509e@gmail.com

#212

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#211)

Re: Minimal logical decoding on standbys

Hi,

On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:

On 1/6/23 4:40 AM, Andres Freund wrote:

Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.

Not sure I got it. What the C helper would be supposed to do?

Call LogStandbySnapshot().

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are invalidated)
because the slots are not dropped yet. What do you think?

I can't quite follow the steps leading up to the difference. Could you list
them in a bit more detail?

Can we do something cheaper than rewriting the entire database? Seems
rewriting a single table ought to be sufficient?

While implementing the test at the table level I discovered that It looks like there is no guarantee that say a "vacuum full pg_class;" would
produce a conflict.

I assume that's mostly when there weren't any removal

Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0:

"rmgr: Heap2 len (rec/tot): 107/ 107, tx: 848, lsn: 0/03B98B30, prev 0/03B98AF0, desc: PRUNE snapshotConflictHorizon 0"

Having a snapshotConflictHorizon to zero leads to ResolveRecoveryConflictWithSnapshot() simply returning
without any conflict handling.

That doesn't have to mean anything bad. Some row versions can be removed without
creating a conflict. See HeapTupleHeaderAdvanceConflictHorizon(), specifically

* Ignore tuples inserted by an aborted transaction or if the tuple was
* updated/deleted by the inserting transaction.

It does look like that in the standby decoding case that's not the right behavior (and that the xid that generated the PRUNING should be used instead)
, what do you think?

That'd not work, because that'll be typically newer than the catalog_xmin. So
we'd start invalidating things left and right, despite not needing to.

Did you see anything else around this making you suspicious?

+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
I think this also should test the streaming / walsender case.
Do you mean cascading standby?

I mean a logical walsender that starts on a standby and continues across
promotion of the standby.

Greetings,

Andres Freund

#213

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#210)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/16/23 3:28 PM, Drouvot, Bertrand wrote:

Hi,

On 1/13/23 10:17 AM, Drouvot, Bertrand wrote:

Hi,

On 1/11/23 9:27 PM, Andres Freund wrote:

Hi,

On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:

The problem I have with that is that I saw a lot of flakiness in the tests due
to the race condition. So introducing them in that order just doesn't make a
whole lot of sense to me.

You are right it does not make sense to introduce fixing the race condition after the TAP tests
and after introducing the decoding logic. I'll reorder the sub-patches.

V40 attached is changing the sub-patches ordering.

The suggested path in earlier versions to avoid doing so was to make sure that
we pass down the Relation for the table into the necessary functions. Did you
explore that any further?

So, for gistXLogPageReuse() and _bt_delitems_delete() this is "easy" to pass the Heap Relation.
This is what was done in earlier versions of this patch series.

But we would need to define a way to propagate the Heap Relation for those 2 functions:

_bt_log_reuse_page()
vacuumRedirectAndPlaceholder()

V40 is getting rid of the new indisusercatalog field in pg_index and is passing the
heap relation all the way down to _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() instead
(and obviously to gistXLogPageReuse() and _bt_delitems_delete() too).

Remarks:

1) V40 adds the heap relation in the IndexVacuumInfo and ParallelVacuumState structs. It is used
for the _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() cases where I did not find any place
where to get the heap relation from in the existing code path.

2) V40 adds a "real" heap relation to all the _bt_getbuf() calls. Another option could have been
to add it only for the code paths leading to _bt_log_reuse_page() but I thought it is cleaner to
do it for all of them.

I will give it another look, also because I just realized that it could be beneficial
for vacuumRedirectAndPlaceholder() too, as per this comment:

"
/* XXX: providing heap relation would allow more pruning */
vistest = GlobalVisTestFor(NULL);
"

Now, we could also pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder().
Could be done in or independently of this patch series once committed (it's not part of V40).

Please find attached V41, tiny rebase due to 47bb9db759.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v41-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v41-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From f12f5d0b9c90f2a580c37e2a2fba4656c683414b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 19 Jan 2023 07:07:33 +0000
Subject: [PATCH v41 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4cf863a76f..0387558d75 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v41-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v41-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 43ea004b213c1a3b5a6c71fe1923efc4d64dd2fe Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 19 Jan 2023 07:06:47 +0000
Subject: [PATCH v41 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..6f3c9a6910 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v41-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v41-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 13ebd7c1c86ad78477bf67c1f76d12d93d4f37c1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 19 Jan 2023 07:05:48 +0000
Subject: [PATCH v41 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5e65785306..57fa6d68d0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0efa98954a..f7c3ef676d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1548,6 +1548,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,7 +1567,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1650,20 +1650,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v41-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v41-0003-Allow-logical-decoding-on-standby.patchDownload

From c7562f99c995e2936553d475b921216264c3ad6c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 19 Jan 2023 07:05:01 +0000
Subject: [PATCH v41 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 29 ++++++++----
 src/include/access/xlog.h                 |  1 +
 6 files changed, 103 insertions(+), 54 deletions(-)
   5.0% src/backend/access/transam/
  40.0% src/backend/replication/logical/
  54.0% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6ebfacf9..d1ee3493ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4463,6 +4463,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..c1e43dd2b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..93a4fcf15a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f22572be30..1f7a686cb1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1213,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..0efa98954a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,14 +906,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
@@ -3074,10 +3078,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3176,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v41-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v41-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 330b6f724b0708cf8f0bd757325662d1f86b87be Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 19 Jan 2023 06:57:10 +0000
Subject: [PATCH v41 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 191 +++++++++++++-----
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 22 files changed, 268 insertions(+), 55 deletions(-)
   3.4% doc/src/sgml/
   8.5% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  56.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   5.5% src/backend/
   3.5% src/include/replication/
   3.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e3a783abd0..ee0bba3312 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4355,6 +4355,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d0cf1b2c81..5b1562fe0e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0e37bad213..b204dfe130 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8890,6 +8890,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9059,6 +9060,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9176,6 +9178,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f47fb7570..5b6ebfacf9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6445,6 +6445,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6805,7 +6806,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7084,6 +7086,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7249,7 +7252,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7967,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..4bd1aa401a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..f22572be30 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1224,20 +1224,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteOrConflictingLogicalReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1246,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1265,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,11 +1311,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
-			/* Let caller know */
-			*invalidated = true;
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+				/* Let caller know */
+				*invalidated = true;
+			}
 		}
 
 		SpinLockRelease(&s->mutex);
@@ -1327,15 +1353,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1419,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1458,38 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 beeninvalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		*beeninvalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1500,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, beeninvalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any obsolete slots have been invalidated, recalculate the resource
+	 * limits.
 	 */
-	if (invalidated)
+	if (!xid && *beeninvalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..daba766947 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 86eb8e8c58..57fafd68a0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5550,6 +5550,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..d392b5eec5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..7d4831dffe 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v41-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v41-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 955599b28b96a7e103c4a45b50e8b10a7cc36863 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 16 Jan 2023 14:16:39 +0000
Subject: [PATCH v41 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/amcheck/verify_nbtree.c            | 17 ++--
 src/backend/access/gist/gist.c             |  2 +-
 src/backend/access/gist/gistbuild.c        |  2 +-
 src/backend/access/gist/gistutil.c         |  4 +-
 src/backend/access/gist/gistxlog.c         | 14 ++-
 src/backend/access/hash/hash_xlog.c        | 12 +--
 src/backend/access/hash/hashinsert.c       |  1 +
 src/backend/access/heap/heapam.c           |  5 +-
 src/backend/access/heap/heapam_handler.c   |  9 +-
 src/backend/access/heap/pruneheap.c        |  1 +
 src/backend/access/heap/vacuumlazy.c       |  2 +
 src/backend/access/heap/visibilitymap.c    |  3 +-
 src/backend/access/nbtree/nbtinsert.c      | 82 +++++++++---------
 src/backend/access/nbtree/nbtpage.c        | 99 ++++++++++++----------
 src/backend/access/nbtree/nbtree.c         |  4 +-
 src/backend/access/nbtree/nbtsearch.c      | 45 +++++-----
 src/backend/access/nbtree/nbtsort.c        |  2 +-
 src/backend/access/nbtree/nbtutils.c       |  7 +-
 src/backend/access/spgist/spgvacuum.c      |  9 +-
 src/backend/catalog/index.c                |  1 +
 src/backend/commands/analyze.c             |  1 +
 src/backend/commands/vacuumparallel.c      |  6 ++
 src/backend/optimizer/util/plancat.c       |  2 +-
 src/backend/utils/sort/tuplesortvariants.c |  7 +-
 src/include/access/genam.h                 |  1 +
 src/include/access/gist_private.h          |  4 +-
 src/include/access/gistxlog.h              | 11 +--
 src/include/access/hash_xlog.h             |  8 +-
 src/include/access/heapam_xlog.h           |  8 +-
 src/include/access/nbtree.h                | 31 ++++---
 src/include/access/nbtxlog.h               |  6 +-
 src/include/access/spgxlog.h               |  1 +
 src/include/access/visibilitymapdefs.h     |  9 +-
 src/include/utils/rel.h                    |  1 +
 src/include/utils/tuplesort.h              |  3 +-
 35 files changed, 234 insertions(+), 186 deletions(-)
   4.8% contrib/amcheck/
   5.0% src/backend/access/gist/
   5.3% src/backend/access/heap/
  55.8% src/backend/access/nbtree/
   5.0% src/backend/access/
   3.3% src/backend/
  19.7% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..8d3abbdceb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,8 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
-													IndexTuple itup);
+													IndexTuple itup,
+													Relation heaprel);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, &heapkeyspace, &allequalimage, heaprel);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup, state->heaprel);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup, state->heaprel);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, state->heaprel);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL, state->heaprel);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, heaprel);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..235f1a1843 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..a87890b965 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..119e34ce0f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heaprel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heaprel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..d0cf1b2c81 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation heaprel, Relation rel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 388df94a44..0e37bad213 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6871,6 +6871,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8441,7 +8442,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8453,6 +8454,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..30730c24bf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,11 +720,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
 											maintenance_work_mem,
-											NULL, TUPLESORT_NONE);
+											NULL, TUPLESORT_NONE, NewHeap);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..c48b272431 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate,
+								 Relation heaprel);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,7 +42,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -50,14 +51,15 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Size itemsz,
 						   OffsetNumber newitemoff,
 						   int postingoff,
-						   bool split_only_page);
+						   bool split_only_page,
+						   Relation heaprel);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
 						IndexTuple newitem, IndexTuple orignewitem,
-						IndexTuple nposting, uint16 postingoff);
+						IndexTuple nposting, uint16 postingoff, Relation heaprel);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+							  BTStack stack, bool isroot, bool isonly, Relation heaprel);
+static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, heapRel);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, &insertstate, heapRel);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -257,7 +259,7 @@ search:
 									   indexUnchanged, stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
 					   itup, insertstate.itemsz, newitemoff,
-					   insertstate.postingoff, false);
+					   insertstate.postingoff, false, heapRel);
 	}
 	else
 	{
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, BTInsertState insertstate, Relation heaprel)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -376,7 +378,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
 	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+					  NULL, heaprel);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, insertstate, stack, heapRel);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, insertstate, stack, heapRel);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, rbuf, stack, heaprel);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1107,7 +1109,8 @@ _bt_insertonpg(Relation rel,
 			   Size itemsz,
 			   OffsetNumber newitemoff,
 			   int postingoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1210,7 +1213,7 @@ _bt_insertonpg(Relation rel,
 
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+						 origitup, nposting, postingoff, heaprel);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly, heaprel);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1461,7 +1464,8 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff,
+		  Relation heaprel)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1712,7 +1716,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1889,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE, heaprel);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2096,7 +2100,8 @@ _bt_insert_parent(Relation rel,
 				  Buffer rbuf,
 				  BTStack stack,
 				  bool isroot,
-				  bool isonly)
+				  bool isonly,
+				  Relation heaprel)
 {
 	/*
 	 * Here we have to do something Lehman and Yao don't talk about: deal with
@@ -2118,7 +2123,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, buf, rbuf, heaprel);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2162,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL,
+									heaprel);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2189,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, stack, bknum, heaprel);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2209,7 +2215,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
-					   stack->bts_offset + 1, 0, isonly);
+					   stack->bts_offset + 1, 0, isonly, heaprel);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2227,7 +2233,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack, Relation heaprel)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2246,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE, heaprel);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2258,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2275,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly, heaprel);
 }
 
 /*
@@ -2304,7 +2310,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child, Relation heaprel)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2324,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, blkno, BT_WRITE, heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, buf, stack->bts_parent, heaprel);
 			continue;
 		}
 
@@ -2428,7 +2434,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2460,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edca7aebb2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -39,16 +39,19 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   FullTransactionId safexid);
+							   FullTransactionId safexid,
+							   Relation heaprel);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heaprel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+								   BTStack stack,
+								   Relation heaprel);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
@@ -58,7 +61,8 @@ static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									Buffer *subtreeparent,
 									OffsetNumber *poffset,
 									BlockNumber *topparent,
-									BlockNumber *topparentrightsib);
+									BlockNumber *topparentrightsib,
+									Relation heaprel);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
 							   FullTransactionId safexid);
 
@@ -178,7 +182,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +195,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +235,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +259,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +344,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, int access, Relation heaprel)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +374,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ, heaprel);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +400,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +433,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, access, heaprel);
 		}
 
 		/*
@@ -437,7 +441,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +578,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +600,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +673,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +681,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +737,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +745,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +829,7 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid, Relation heaprel)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +840,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +873,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access, Relation heaprel)
 {
 	Buffer		buf;
 
@@ -944,7 +949,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 						_bt_log_reuse_page(rel, blkno,
-										   BTPageGetDeleteXid(page));
+										   BTPageGetDeleteXid(page), heaprel);
 
 					/* Okay to use page.  Re-initialize and return it. */
 					_bt_pageinit(page, BufferGetPageSize(buf));
@@ -1296,7 +1301,7 @@ static void
 _bt_delitems_delete(Relation rel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable, Relation heaprel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1363,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+						deletable, ndeletable, updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1712,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1723,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, leftsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1769,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1778,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, leafrightsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1967,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno, vstate->info->heaprel))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, vstate->info->heaprel);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL,
+								   vstate->info->heaprel);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2009,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, leafbuf, stack, vstate->info->heaprel))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2072,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	}
 }
 
@@ -2084,7 +2091,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack, Relation heaprel)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2126,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, leafrightsib, heaprel))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2145,7 +2152,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	topparentrightsib = leafrightsib;
 	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
 								 &subtreeparent, &poffset,
-								 &topparent, &topparentrightsib))
+								 &topparent, &topparentrightsib, heaprel))
 		return false;
 
 	/*
@@ -2363,7 +2370,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, target, BT_READ, vstate->info->heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2397,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2447,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2511,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2540,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE,
+								 vstate->info->heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2775,7 +2783,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 static bool
 _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+						BlockNumber *topparent, BlockNumber *topparentrightsib,
+						Relation heaprel)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2798,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, stack, child, heaprel);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,13 +2898,13 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, leftsibparent, parent, heaprel))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
 	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
-								   topparent, topparentrightsib);
+								   topparent, topparentrightsib, heaprel);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..705716e333 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, num_delpages, info->heaprel);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..6466fe2f58 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -94,13 +94,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  */
 BTStack
 _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+		   Snapshot snapshot, Relation heaprel)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, access, heaprel);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -130,7 +130,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+							  page_access, snapshot, heaprel);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -191,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * move right to its new sibling.  Do that.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
-							  snapshot);
+							  snapshot, heaprel);
 	}
 
 	return stack_in;
@@ -239,7 +239,8 @@ _bt_moveright(Relation rel,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
-			  Snapshot snapshot)
+			  Snapshot snapshot,
+			  Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -288,12 +289,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, buf, stack, heaprel);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access, heaprel);
 			continue;
 		}
 
@@ -860,6 +861,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1354,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage, heaprel);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1365,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot, heaprel);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2006,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ, scan->heapRelation);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2080,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ,
+										 scan->heapRelation);
 
 		for (;;)
 		{
@@ -2093,7 +2096,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 			/* Step to next physical page */
 			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+											scan->xs_snapshot, scan->heapRelation);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2143,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ,
+											 scan->heapRelation);
 			}
 		}
 	}
@@ -2185,7 +2189,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2217,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ, heaprel);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2305,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  */
 Buffer
 _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-				 Snapshot snapshot)
+				 Snapshot snapshot, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -2320,9 +2324,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, BT_READ, heaprel);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2407,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot,
+						   scan->heapRelation);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..542029eec7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, btspool->heap);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..9edd856371 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage, heaprel);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ,
+						 scan->heapRelation);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..a8fc221153 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Buffer buffer, Relation heaprel)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e6579f2979..9411730b36 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..7d9c1c7eca 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -208,7 +208,8 @@ Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem,
-						SortCoordinate coordinate, int sortopt)
+						SortCoordinate coordinate, int sortopt,
+						Relation heaprel)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heaprel);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heapRel);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..9bdac12baf 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..8f1dfedaaf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool		isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..cdcfdd6030 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack,
+							 Relation heaprel);
+extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child,
+							  Relation heaprel);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
+								 Relation heaprel);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
+extern Buffer _bt_getroot(Relation rel, int access, Relation heaprel);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
 extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
-							bool *allequalimage);
+							bool *allequalimage, Relation heaprel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access,
+						 Relation heaprel);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1230,20 +1234,21 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
+						  int access, Snapshot snapshot, Relation heaprel);
 extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+							bool forupdate, BTStack stack, int access,
+							Snapshot snapshot, Relation heaprel);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+							   Snapshot snapshot, Relation heaprel);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..06aebe6330 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -401,7 +401,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 											   Relation indexRel, int workMem,
 											   SortCoordinate coordinate,
-											   int sortopt);
+											   int sortopt,
+											   Relation heaprel);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
-- 
2.34.1

#214

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#212)

Re: Minimal logical decoding on standbys

Hi,

On 1/19/23 3:46 AM, Andres Freund wrote:

Hi,

On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:

On 1/6/23 4:40 AM, Andres Freund wrote:

Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.

Not sure I got it. What the C helper would be supposed to do?

Call LogStandbySnapshot().

Got it, I like the idea, will do.

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are invalidated)
because the slots are not dropped yet. What do you think?

I can't quite follow the steps leading up to the difference. Could you list
them in a bit more detail?

Sure, so with:

1) hot_standby_feedback set to off on the standby
2) create 2 logical replication slots on the standby and activate one
3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
4) change hot_standby_feedback to on on the standby

If:

5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin from pg_replication_slots ;
slot_type | xmin | catalog_xmin
-----------+------+--------------
physical | 822 | 748
(1 row)

But if:

5) re-start the standby, then on the primary we get an empty catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin from pg_replication_slots ;
slot_type | xmin | catalog_xmin
-----------+------+--------------
physical | 816 |
(1 row)

Can we do something cheaper than rewriting the entire database? Seems
rewriting a single table ought to be sufficient?

While implementing the test at the table level I discovered that It looks like there is no guarantee that say a "vacuum full pg_class;" would
produce a conflict.

I assume that's mostly when there weren't any removal

Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0:

"rmgr: Heap2 len (rec/tot): 107/ 107, tx: 848, lsn: 0/03B98B30, prev 0/03B98AF0, desc: PRUNE snapshotConflictHorizon 0"

Having a snapshotConflictHorizon to zero leads to ResolveRecoveryConflictWithSnapshot() simply returning
without any conflict handling.

That doesn't have to mean anything bad. Some row versions can be removed without
creating a conflict. See HeapTupleHeaderAdvanceConflictHorizon(), specifically

* Ignore tuples inserted by an aborted transaction or if the tuple was
* updated/deleted by the inserting transaction.

It does look like that in the standby decoding case that's not the right behavior (and that the xid that generated the PRUNING should be used instead)
, what do you think?

That'd not work, because that'll be typically newer than the catalog_xmin. So
we'd start invalidating things left and right, despite not needing to.

Okay, thanks for the explanations that makes sense.

Did you see anything else around this making you suspicious?

No, but a question still remains to me:

Given the fact that the row removal case is already done
in the next test (aka Scenario 2), If we want to replace the "vacuum full" test
on the database (done in Scenario 1) with a cheaper one at the table level,
what could it be to guarantee an invalidation?

Same as scenario 2 but with "vacuum full pg_class" would not really add value
to the tests, right?

+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
I think this also should test the streaming / walsender case.
Do you mean cascading standby?
I mean a logical walsender that starts on a standby and continues across
promotion of the standby.

Got it, thanks, will do.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#215

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#214)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:

Hi,

On 1/19/23 3:46 AM, Andres Freund wrote:

Hi,

I mean a logical walsender that starts on a standby and continues across
promotion of the standby.

Got it, thanks, will do.

While working on it, I noticed that with V41 a:

pg_recvlogical -S active_slot -P test_decoding -d postgres -f - --start

on the standby is getting:

pg_recvlogical: error: unexpected termination of replication stream: ERROR: could not find record while sending logically-decoded data: invalid record length at 0/311C438: wanted 24, got 0
pg_recvlogical: disconnected; waiting 5 seconds to try again

when the standby gets promoted (the logical decoding is able to resume correctly after the error though).

This is fixed in V42 attached (no error anymore and logical decoding through the walsender works correctly after the promotion).

The fix is in 0003 where in logical_read_xlog_page() (as compare to V41):

- We now check if RecoveryInProgress() (instead of relying on am_cascading_walsender) to check if the standby got promoted
- Based on this, the currTLI is being retrieved with GetXLogReplayRecPtr() or GetWALInsertionTimeLine() (so, with GetWALInsertionTimeLine() after promotion)
- This currTLI is being used as an argument in WALRead() (instead of state->seg.ws_tli, which anyhow sounds weird as being
compared with itself that way "tli != state->seg.ws_tli" in WALRead()). That way WALRead() discovers that the timeline changed and then opens the right WAL file.

Please find V42 attached.

I'll resume working on the TAP tests comments.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v42-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v42-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From bcf71f46a01e4b84191627a6b94f0066c38f8301 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:13:23 +0000
Subject: [PATCH v42 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 479 ++++++++++++++++++
 3 files changed, 517 insertions(+)
   6.0% src/test/perl/PostgreSQL/Test/
  93.7% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..6f3c9a6910 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4258844c8f
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,479 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use Test::More tests => 42;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	print "starting pg_recvlogical\n";
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	# make sure activeslot is in use
+	{
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+print "waiting to replay $endpos\n";
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+create_logical_slots();
+
+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to row removal');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to row removal');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = on
+]);
+$node_standby->restart;
+
+# ensure walreceiver feedback sent by waiting for expected xmin and
+# catalog_xmin on primary. With hot_standby_feedback on, xmin should advance,
+# but catalog_xmin should still remain NULL since there is no logical slot.
+wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 3: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across restarts
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+create_logical_slots();
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
-- 
2.34.1

v42-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v42-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 56f6baf2d4ed228caa0a147b7fd13ca6cdcadcf4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:12:15 +0000
Subject: [PATCH v42 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c14d1f3ef6..45d170e008 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4921,3 +4930,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..b3fe5dbeb2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1570,7 +1571,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1654,20 +1654,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v42-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v42-0003-Allow-logical-decoding-on-standby.patchDownload

From d9d4d4782a9827466e9724736c91977a34ba5b1f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:11:29 +0000
Subject: [PATCH v42 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.8% src/backend/access/transam/
  38.4% src/backend/replication/logical/
  55.8% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a071fc6871..f14e1755b7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4465,6 +4465,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..c1e43dd2b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..93a4fcf15a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f22572be30..1f7a686cb1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1175,37 +1176,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1221,6 +1213,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v42-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v42-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From e9d232145c57e03e7d01eaf8931014d6d1377ba2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:10:24 +0000
Subject: [PATCH v42 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 191 +++++++++++++-----
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 22 files changed, 268 insertions(+), 55 deletions(-)
   3.4% doc/src/sgml/
   8.5% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  56.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   7.3% src/backend/tcop/
   5.5% src/backend/
   3.5% src/include/replication/
   3.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d0cf1b2c81..5b1562fe0e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0e37bad213..b204dfe130 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8890,6 +8890,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9059,6 +9060,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9176,6 +9178,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..a071fc6871 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6447,6 +6447,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7086,6 +7088,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7251,7 +7254,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7966,6 +7970,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..4bd1aa401a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..f22572be30 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1224,20 +1224,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteOrConflictingLogicalReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1246,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1265,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,11 +1311,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
-			/* Let caller know */
-			*invalidated = true;
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+				/* Let caller know */
+				*invalidated = true;
+			}
 		}
 
 		SpinLockRelease(&s->mutex);
@@ -1327,15 +1353,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1419,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1458,38 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 beeninvalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		*beeninvalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1500,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, beeninvalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any obsolete slots have been invalidated, recalculate the resource
+	 * limits.
 	 */
-	if (invalidated)
+	if (!xid && *beeninvalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..daba766947 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..659e5bdc3a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..d392b5eec5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *beeninvalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..7d4831dffe 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v42-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v42-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 3c206bd77831d507f4f95e1942eb26855524571a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:07:51 +0000
Subject: [PATCH v42 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/amcheck/verify_nbtree.c            | 17 ++--
 src/backend/access/gist/gist.c             |  2 +-
 src/backend/access/gist/gistbuild.c        |  2 +-
 src/backend/access/gist/gistutil.c         |  4 +-
 src/backend/access/gist/gistxlog.c         | 14 ++-
 src/backend/access/hash/hash_xlog.c        | 12 +--
 src/backend/access/hash/hashinsert.c       |  1 +
 src/backend/access/heap/heapam.c           |  5 +-
 src/backend/access/heap/heapam_handler.c   |  9 +-
 src/backend/access/heap/pruneheap.c        |  1 +
 src/backend/access/heap/vacuumlazy.c       |  2 +
 src/backend/access/heap/visibilitymap.c    |  3 +-
 src/backend/access/nbtree/nbtinsert.c      | 82 +++++++++---------
 src/backend/access/nbtree/nbtpage.c        | 99 ++++++++++++----------
 src/backend/access/nbtree/nbtree.c         |  4 +-
 src/backend/access/nbtree/nbtsearch.c      | 45 +++++-----
 src/backend/access/nbtree/nbtsort.c        |  2 +-
 src/backend/access/nbtree/nbtutils.c       |  7 +-
 src/backend/access/spgist/spgvacuum.c      |  9 +-
 src/backend/catalog/index.c                |  1 +
 src/backend/commands/analyze.c             |  1 +
 src/backend/commands/vacuumparallel.c      |  6 ++
 src/backend/optimizer/util/plancat.c       |  2 +-
 src/backend/utils/sort/tuplesortvariants.c |  7 +-
 src/include/access/genam.h                 |  1 +
 src/include/access/gist_private.h          |  4 +-
 src/include/access/gistxlog.h              | 11 +--
 src/include/access/hash_xlog.h             |  8 +-
 src/include/access/heapam_xlog.h           |  8 +-
 src/include/access/nbtree.h                | 31 ++++---
 src/include/access/nbtxlog.h               |  6 +-
 src/include/access/spgxlog.h               |  1 +
 src/include/access/visibilitymapdefs.h     |  9 +-
 src/include/utils/rel.h                    |  1 +
 src/include/utils/tuplesort.h              |  3 +-
 35 files changed, 234 insertions(+), 186 deletions(-)
   4.8% contrib/amcheck/
   5.0% src/backend/access/gist/
   5.3% src/backend/access/heap/
  55.8% src/backend/access/nbtree/
   5.0% src/backend/access/
   3.3% src/backend/
  19.7% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..8d3abbdceb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,8 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
-													IndexTuple itup);
+													IndexTuple itup,
+													Relation heaprel);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, &heapkeyspace, &allequalimage, heaprel);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup, state->heaprel);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup, state->heaprel);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, state->heaprel);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL, state->heaprel);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, heaprel);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..235f1a1843 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..a87890b965 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..119e34ce0f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heaprel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heaprel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..d0cf1b2c81 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation heaprel, Relation rel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 388df94a44..0e37bad213 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6871,6 +6871,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8441,7 +8442,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8453,6 +8454,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..30730c24bf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,11 +720,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
 											maintenance_work_mem,
-											NULL, TUPLESORT_NONE);
+											NULL, TUPLESORT_NONE, NewHeap);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..c48b272431 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate,
+								 Relation heaprel);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,7 +42,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -50,14 +51,15 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Size itemsz,
 						   OffsetNumber newitemoff,
 						   int postingoff,
-						   bool split_only_page);
+						   bool split_only_page,
+						   Relation heaprel);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
 						IndexTuple newitem, IndexTuple orignewitem,
-						IndexTuple nposting, uint16 postingoff);
+						IndexTuple nposting, uint16 postingoff, Relation heaprel);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+							  BTStack stack, bool isroot, bool isonly, Relation heaprel);
+static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, heapRel);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, &insertstate, heapRel);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -257,7 +259,7 @@ search:
 									   indexUnchanged, stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
 					   itup, insertstate.itemsz, newitemoff,
-					   insertstate.postingoff, false);
+					   insertstate.postingoff, false, heapRel);
 	}
 	else
 	{
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, BTInsertState insertstate, Relation heaprel)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -376,7 +378,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
 	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+					  NULL, heaprel);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, insertstate, stack, heapRel);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, insertstate, stack, heapRel);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, rbuf, stack, heaprel);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1107,7 +1109,8 @@ _bt_insertonpg(Relation rel,
 			   Size itemsz,
 			   OffsetNumber newitemoff,
 			   int postingoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1210,7 +1213,7 @@ _bt_insertonpg(Relation rel,
 
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+						 origitup, nposting, postingoff, heaprel);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly, heaprel);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1461,7 +1464,8 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff,
+		  Relation heaprel)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1712,7 +1716,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1889,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE, heaprel);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2096,7 +2100,8 @@ _bt_insert_parent(Relation rel,
 				  Buffer rbuf,
 				  BTStack stack,
 				  bool isroot,
-				  bool isonly)
+				  bool isonly,
+				  Relation heaprel)
 {
 	/*
 	 * Here we have to do something Lehman and Yao don't talk about: deal with
@@ -2118,7 +2123,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, buf, rbuf, heaprel);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2162,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL,
+									heaprel);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2189,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, stack, bknum, heaprel);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2209,7 +2215,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
-					   stack->bts_offset + 1, 0, isonly);
+					   stack->bts_offset + 1, 0, isonly, heaprel);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2227,7 +2233,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack, Relation heaprel)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2246,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE, heaprel);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2258,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2275,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly, heaprel);
 }
 
 /*
@@ -2304,7 +2310,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child, Relation heaprel)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2324,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, blkno, BT_WRITE, heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, buf, stack->bts_parent, heaprel);
 			continue;
 		}
 
@@ -2428,7 +2434,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2460,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edca7aebb2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -39,16 +39,19 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   FullTransactionId safexid);
+							   FullTransactionId safexid,
+							   Relation heaprel);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heaprel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+								   BTStack stack,
+								   Relation heaprel);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
@@ -58,7 +61,8 @@ static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									Buffer *subtreeparent,
 									OffsetNumber *poffset,
 									BlockNumber *topparent,
-									BlockNumber *topparentrightsib);
+									BlockNumber *topparentrightsib,
+									Relation heaprel);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
 							   FullTransactionId safexid);
 
@@ -178,7 +182,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +195,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +235,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +259,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +344,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, int access, Relation heaprel)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +374,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ, heaprel);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +400,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +433,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, access, heaprel);
 		}
 
 		/*
@@ -437,7 +441,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +578,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +600,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +673,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +681,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +737,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +745,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +829,7 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid, Relation heaprel)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +840,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +873,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access, Relation heaprel)
 {
 	Buffer		buf;
 
@@ -944,7 +949,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 						_bt_log_reuse_page(rel, blkno,
-										   BTPageGetDeleteXid(page));
+										   BTPageGetDeleteXid(page), heaprel);
 
 					/* Okay to use page.  Re-initialize and return it. */
 					_bt_pageinit(page, BufferGetPageSize(buf));
@@ -1296,7 +1301,7 @@ static void
 _bt_delitems_delete(Relation rel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable, Relation heaprel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1363,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+						deletable, ndeletable, updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1712,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1723,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, leftsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1769,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1778,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, leafrightsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1967,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno, vstate->info->heaprel))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, vstate->info->heaprel);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL,
+								   vstate->info->heaprel);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2009,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, leafbuf, stack, vstate->info->heaprel))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2072,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	}
 }
 
@@ -2084,7 +2091,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack, Relation heaprel)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2126,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, leafrightsib, heaprel))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2145,7 +2152,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	topparentrightsib = leafrightsib;
 	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
 								 &subtreeparent, &poffset,
-								 &topparent, &topparentrightsib))
+								 &topparent, &topparentrightsib, heaprel))
 		return false;
 
 	/*
@@ -2363,7 +2370,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, target, BT_READ, vstate->info->heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2397,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2447,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2511,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2540,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE,
+								 vstate->info->heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2775,7 +2783,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 static bool
 _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+						BlockNumber *topparent, BlockNumber *topparentrightsib,
+						Relation heaprel)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2798,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, stack, child, heaprel);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,13 +2898,13 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, leftsibparent, parent, heaprel))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
 	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
-								   topparent, topparentrightsib);
+								   topparent, topparentrightsib, heaprel);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..705716e333 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, num_delpages, info->heaprel);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..6466fe2f58 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -94,13 +94,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  */
 BTStack
 _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+		   Snapshot snapshot, Relation heaprel)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, access, heaprel);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -130,7 +130,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+							  page_access, snapshot, heaprel);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -191,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * move right to its new sibling.  Do that.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
-							  snapshot);
+							  snapshot, heaprel);
 	}
 
 	return stack_in;
@@ -239,7 +239,8 @@ _bt_moveright(Relation rel,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
-			  Snapshot snapshot)
+			  Snapshot snapshot,
+			  Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -288,12 +289,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, buf, stack, heaprel);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access, heaprel);
 			continue;
 		}
 
@@ -860,6 +861,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1354,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage, heaprel);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1365,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot, heaprel);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2006,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ, scan->heapRelation);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2080,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ,
+										 scan->heapRelation);
 
 		for (;;)
 		{
@@ -2093,7 +2096,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 			/* Step to next physical page */
 			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+											scan->xs_snapshot, scan->heapRelation);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2143,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ,
+											 scan->heapRelation);
 			}
 		}
 	}
@@ -2185,7 +2189,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2217,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ, heaprel);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2305,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  */
 Buffer
 _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-				 Snapshot snapshot)
+				 Snapshot snapshot, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -2320,9 +2324,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, BT_READ, heaprel);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2407,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot,
+						   scan->heapRelation);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..542029eec7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, btspool->heap);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..9edd856371 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage, heaprel);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ,
+						 scan->heapRelation);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..a8fc221153 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Buffer buffer, Relation heaprel)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..7d9c1c7eca 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -208,7 +208,8 @@ Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem,
-						SortCoordinate coordinate, int sortopt)
+						SortCoordinate coordinate, int sortopt,
+						Relation heaprel)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heaprel);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heapRel);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..9bdac12baf 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..8f1dfedaaf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool		isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..cdcfdd6030 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack,
+							 Relation heaprel);
+extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child,
+							  Relation heaprel);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
+								 Relation heaprel);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
+extern Buffer _bt_getroot(Relation rel, int access, Relation heaprel);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
 extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
-							bool *allequalimage);
+							bool *allequalimage, Relation heaprel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access,
+						 Relation heaprel);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1230,20 +1234,21 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
+						  int access, Snapshot snapshot, Relation heaprel);
 extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+							bool forupdate, BTStack stack, int access,
+							Snapshot snapshot, Relation heaprel);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+							   Snapshot snapshot, Relation heaprel);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..06aebe6330 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -401,7 +401,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 											   Relation indexRel, int workMem,
 											   SortCoordinate coordinate,
-											   int sortopt);
+											   int sortopt,
+											   Relation heaprel);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
-- 
2.34.1

v42-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v42-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 54453b896174e9e28a5e27d4f749845e9260afb0 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:14:05 +0000
Subject: [PATCH v42 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..2e8bee033f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

#216

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#215)

Re: Minimal logical decoding on standbys

I'm new to this thread and subject, but I had a few basic thoughts about
the first patch in the set.

On Mon, Jan 23, 2023 at 12:03:35PM +0100, Drouvot, Bertrand wrote:

Please find V42 attached.

From 3c206bd77831d507f4f95e1942eb26855524571a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:07:51 +0000
Subject: [PATCH v42 1/6] Add info in WAL records in preparation for logical
slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

It is a little confusing to mention replication conflicts in point 1. It
makes it sound like it already logs a recovery conflict. Without the
recovery conflict handling in this patchset, logical decoding of
statements using data that has been removed will fail with some error
like :
ERROR: could not map filenumber "xxx" to relation OID
Part of what this patchset does is introduce the concept of a new kind
of recovery conflict and a resolution process.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

It isn't clear from the above sentence why you would need both. I think
it has something to do with what is below (hot_standby_feedback being
off), but I'm not sure, so the order is confusing.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
the primary and the standby. Then, hot_standby_feedback will work,
but only while the connection is alive (for example a node restart
would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

I would like a few more specifics about this commit (patch 1 in the set)
itself in the commit message.

I think it would be good to have the commit message mention what kinds
of operations require WAL to contain information about whether or not it
is operating on a catalog table and why this is.

For example, I think the explanation of the feature makes it clear why
vacuum and pruning operations would require isCatalogRel, however it
isn't immediately obvious why page reuse would.

Also, because the diff has so many function signatures changed, it is
hard to tell which xlog record types are actually being changed to
include isCatalogRel. It might be too detailed/repetitive for the final
commit message to have a list of the xlog types requiring this info
(gistxlogPageReuse, spgxlogVacuumRedirect, xl_hash_vacuum_one_page,
xl_btree_reuse_page, xl_btree_delete, xl_heap_visible, xl_heap_prune,
xl_heap_freeze_page) but perhaps you could enumerate all the general
operations (freeze page, vacuum, prune, etc).

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

You do mention it a bit here, but I think it could be more clear and
specific.

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..235f1a1843 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
for (; ptr; ptr = ptr->next)
{
/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
ptr->page = BufferGetPage(ptr->buffer);
ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..a87890b965 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
Page		page;

/* initialize the root page */
- buffer = gistNewBuffer(index);
+ buffer = gistNewBuffer(heap, index);
Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
page = BufferGetPage(buffer);

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..119e34ce0f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
* Caller is responsible for initializing the page by calling GISTInitBuffer
*/
Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heaprel, Relation r)
{

It is not very important, but I noticed you made "heaprel" the last
parameter to all of the btree-related functions but the first parameter
to the gist functions. I thought it might be nice to make the order
consistent. I also was wondering why you made it the last argument to
all the btree functions to begin with (i.e. instead of directly after
the first rel argument).

diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..9bdac12baf 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
FullTransactionId xid, Buffer parentBuffer,
OffsetNumber downlinkOffset);

-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
FullTransactionId deleteXid);

extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
extern bool gistfitpage(IndexTuple *itvec, int len);
extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
extern bool gistPageRecyclable(Page page);
extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
{
TransactionId snapshotConflictHorizon;
uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;

In some of these struct definitions, I think it would help comprehension
to have a comment explaining the purpose of this member.

-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];

Thanks for all your hard work on this feature!

Best,
Melanie

#217

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#214)

Re: Minimal logical decoding on standbys

Hi,

On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are invalidated)
because the slots are not dropped yet. What do you think?

I can't quite follow the steps leading up to the difference. Could you list
them in a bit more detail?

Sure, so with:

1) hot_standby_feedback set to off on the standby
2) create 2 logical replication slots on the standby and activate one
3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
4) change hot_standby_feedback to on on the standby

If:

5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin from pg_replication_slots ;
slot_type | xmin | catalog_xmin
-----------+------+--------------
physical | 822 | 748
(1 row)

How long did you wait for this to change? I don't think there's anything right
now that'd force a new hot-standby-feedback message to be sent to the primary,
after slots got invalidated.

I suspect that if you terminated the walsender connection on the primary,
you'd not see it anymore either?

If that isn't it, something is broken in InvalidateObsolete...

No, but a question still remains to me:

Given the fact that the row removal case is already done
in the next test (aka Scenario 2), If we want to replace the "vacuum full" test
on the database (done in Scenario 1) with a cheaper one at the table level,
what could it be to guarantee an invalidation?

Same as scenario 2 but with "vacuum full pg_class" would not really add value
to the tests, right?

A database wide VACUUM FULL is also just a row removal test, no? I think it
makes sense to test that both VACUUM and VACUUM FULL both trigger conflicts,
because they internally use *very* different mechanisms. It'd probably be
good to test at least conflicts triggered due to row removal via on-access
pruning as well. And perhaps also for btree killtuples. I think those are the
common cases for catalog tables.

Greetings,

Andres Freund

#218

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#217)

Re: Minimal logical decoding on standbys

Hi,

On 1/24/23 1:46 AM, Andres Freund wrote:

Hi,

On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are invalidated)
because the slots are not dropped yet. What do you think?

I can't quite follow the steps leading up to the difference. Could you list
them in a bit more detail?

Sure, so with:

1) hot_standby_feedback set to off on the standby
2) create 2 logical replication slots on the standby and activate one
3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
4) change hot_standby_feedback to on on the standby

If:

5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin from pg_replication_slots ;
slot_type | xmin | catalog_xmin
-----------+------+--------------
physical | 822 | 748
(1 row)

How long did you wait for this to change?

Almost instantaneous after pg_reload_conf() on the standby.

I don't think there's anything right
now that'd force a new hot-standby-feedback message to be sent to the primary,
after slots got invalidated.

I suspect that if you terminated the walsender connection on the primary,
you'd not see it anymore either?

Still there after the standby is shutdown but disappears when the standby is re-started.

If that isn't it, something is broken in InvalidateObsolete...

Will look at what's going on and ensure catalog_xmin is not sent to the primary after pg_reload_conf() (if the slots are invalidated).

No, but a question still remains to me:

Given the fact that the row removal case is already done
in the next test (aka Scenario 2), If we want to replace the "vacuum full" test
on the database (done in Scenario 1) with a cheaper one at the table level,
what could it be to guarantee an invalidation?

Same as scenario 2 but with "vacuum full pg_class" would not really add value
to the tests, right?

A database wide VACUUM FULL is also just a row removal test, no?

Yeah, so I was wondering if Scenario 1 was simply not just useless.

I think it
makes sense to test that both VACUUM and VACUUM FULL both trigger conflicts,
because they internally use *very* different mechanisms.

Got it, will do and replace Scenario 1 as you suggested initially.

It'd probably be
good to test at least conflicts triggered due to row removal via on-access
pruning as well. And perhaps also for btree killtuples. I think those are the
common cases for catalog tables.

Thanks for the proposal, will look at it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#219

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#218)

Re: Minimal logical decoding on standbys

Hi,

On 1/24/23 6:20 AM, Drouvot, Bertrand wrote:

Hi,

On 1/24/23 1:46 AM, Andres Freund wrote:

Hi,

On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:

Sure, so with:

1) hot_standby_feedback set to off on the standby
2) create 2 logical replication slots on the standby and activate one
3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
4) change hot_standby_feedback to on on the standby

If:

5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin from pg_replication_slots ;
slot_type | xmin | catalog_xmin
-----------+------+--------------
physical | 822 | 748
(1 row)

How long did you wait for this to change?

Almost instantaneous after pg_reload_conf() on the standby.

I don't think there's anything right
now that'd force a new hot-standby-feedback message to be sent to the primary,
after slots got invalidated.

I suspect that if you terminated the walsender connection on the primary,
you'd not see it anymore either?

Still there after the standby is shutdown but disappears when the standby is re-started.

If that isn't it, something is broken in InvalidateObsolete...

Yeah, you are right: ReplicationSlotsComputeRequiredXmin() is missing for the
logical slot invalidation case (and ReplicationSlotsComputeRequiredXmin() also
needs to take care of it).

I'll provide a fix in the next revision along with the TAP tests comments addressed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#220

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#216)

Re: Minimal logical decoding on standbys

Hi,

On 1/24/23 12:21 AM, Melanie Plageman wrote:

I'm new to this thread and subject, but I had a few basic thoughts about
the first patch in the set.

Thanks for looking at it!

On Mon, Jan 23, 2023 at 12:03:35PM +0100, Drouvot, Bertrand wrote:

Please find V42 attached.

From 3c206bd77831d507f4f95e1942eb26855524571a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 23 Jan 2023 10:07:51 +0000
Subject: [PATCH v42 1/6] Add info in WAL records in preparation for logical
slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

It is a little confusing to mention replication conflicts in point 1. It
makes it sound like it already logs a recovery conflict. Without the
recovery conflict handling in this patchset, logical decoding of
statements using data that has been removed will fail with some error
like :
ERROR: could not map filenumber "xxx" to relation OID
Part of what this patchset does is introduce the concept of a new kind
of recovery conflict and a resolution process.

I think I understand what you mean, what about the following?

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby.

To prevent those errors a new replication conflict scenario
needs to be addressed (as much as hot standby does.)

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

It isn't clear from the above sentence why you would need both. I think
it has something to do with what is below (hot_standby_feedback being
off), but I'm not sure, so the order is confusing.

Right, it has to deal with the xid horizons too. So the idea is to check if
1) there is a risk of conflict and 2) if there is a risk then check
is there a conflict? (based on the xid). I'll reword this part.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
the primary and the standby. Then, hot_standby_feedback will work,
but only while the connection is alive (for example a node restart
would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

I would like a few more specifics about this commit (patch 1 in the set)
itself in the commit message.

I think it would be good to have the commit message mention what kinds
of operations require WAL to contain information about whether or not it
is operating on a catalog table and why this is.

For example, I think the explanation of the feature makes it clear why
vacuum and pruning operations would require isCatalogRel, however it
isn't immediately obvious why page reuse would.

What do you think about putting those extra explanations in the code instead?

Also, because the diff has so many function signatures changed, it is
hard to tell which xlog record types are actually being changed to
include isCatalogRel. It might be too detailed/repetitive for the final
commit message to have a list of the xlog types requiring this info
(gistxlogPageReuse, spgxlogVacuumRedirect, xl_hash_vacuum_one_page,
xl_btree_reuse_page, xl_btree_delete, xl_heap_visible, xl_heap_prune,
xl_heap_freeze_page) but perhaps you could enumerate all the general
operations (freeze page, vacuum, prune, etc).

Right, at the end there is only a few making "real" use of it: they can be
enumerated in the commit message. Will do.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

You do mention it a bit here, but I think it could be more clear and
specific.

Ok, will try to be more clear.

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..235f1a1843 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
for (; ptr; ptr = ptr->next)
{
/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
ptr->page = BufferGetPage(ptr->buffer);
ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..a87890b965 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
Page		page;

/* initialize the root page */
- buffer = gistNewBuffer(index);
+ buffer = gistNewBuffer(heap, index);
Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
page = BufferGetPage(buffer);

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..119e34ce0f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
* Caller is responsible for initializing the page by calling GISTInitBuffer
*/
Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heaprel, Relation r)
{

Agree, will do.

I also was wondering why you made it the last argument to
all the btree functions to begin with (i.e. instead of directly after
the first rel argument).

No real reasons, will put all of them after the first rel argument (that seems a better place).

diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..9bdac12baf 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
FullTransactionId xid, Buffer parentBuffer,
OffsetNumber downlinkOffset);

-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
FullTransactionId deleteXid);

extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
extern bool gistfitpage(IndexTuple *itvec, int len);
extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
extern bool gistPageRecyclable(Page page);
extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
{
TransactionId snapshotConflictHorizon;
uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;

In some of these struct definitions, I think it would help comprehension
to have a comment explaining the purpose of this member.

Yeah, agree but it could be done in another patch (outside of this feature), agree?

Thanks for all your hard work on this feature!

Thanks for the review and the feedback!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#221

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#219)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/24/23 3:31 PM, Drouvot, Bertrand wrote:

Hi,

On 1/24/23 6:20 AM, Drouvot, Bertrand wrote:

Hi,

On 1/24/23 1:46 AM, Andres Freund wrote:

Hi,

On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:

Sure, so with:

1) hot_standby_feedback set to off on the standby
2) create 2 logical replication slots on the standby and activate one
3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
4) change hot_standby_feedback to on on the standby

If:

5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin from pg_replication_slots ;
slot_type | xmin | catalog_xmin
-----------+------+--------------
physical | 822 | 748
(1 row)

How long did you wait for this to change?

Almost instantaneous after pg_reload_conf() on the standby.

I don't think there's anything right
now that'd force a new hot-standby-feedback message to be sent to the primary,
after slots got invalidated.

I suspect that if you terminated the walsender connection on the primary,
you'd not see it anymore either?

Still there after the standby is shutdown but disappears when the standby is re-started.

If that isn't it, something is broken in InvalidateObsolete...

Yeah, you are right: ReplicationSlotsComputeRequiredXmin() is missing for the
logical slot invalidation case (and ReplicationSlotsComputeRequiredXmin() also
needs to take care of it).

I'll provide a fix in the next revision along with the TAP tests comments addressed.

Please find attached V43 addressing the comments related to the TAP tests (in 0004 at that time) that have been done in [1]/messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de .

Remarks:

- The C helper function to call LogStandbySnapshot() is not done yet.
- While working on it, I discovered that the new isCatalogRel field was not populated in gistXLogDelete(): fixed in V43.
- The issue described above is also fixed so that a standby restart or a reload would produce the same behavior
on the primary physical slot (aka catalog_xmin is empty if logical slots are invalidated).
- A test with pg_recvlogical started before the standby promotion has been added.
- A test for conflict due to row removal via on-access pruning has been added.
- I'm struggling to create a test for btree killtuples as there is a need for rows removal on the table (that could produce a conflict too):
Do you've a scenario in mind for this one? (and btw in what kind of WAL record should the conflict be detected in such a case? xl_btree_delete?)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: /messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de

Attachments:

v43-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v43-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From fdac6c20221ae1144dde7927a8dcbe4fc99dba03 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 26 Jan 2023 17:13:15 +0000
Subject: [PATCH v43 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..2e8bee033f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v43-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v43-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 0f236ef7b5e8ecd6a1b2880d873139d44e6a1c53 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Thu, 26 Jan 2023 17:12:11 +0000
Subject: [PATCH v43 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  39 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 608 ++++++++++++++++++
 3 files changed, 648 insertions(+)
   5.2% src/test/perl/PostgreSQL/Test/
  94.5% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..453f37382b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,45 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $master, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	# Note: Write a C helper function to call LogStandbySnapshot() instead
+	# of asking for a checkpoint.
+	$master->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..417049b7d5
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,608 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 56;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on master
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on master");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+$handle = make_slot_active(1);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
-- 
2.34.1

v43-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v43-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From e0fbc3adfc89f7c43066b9cb13a99bec240206f9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 25 Jan 2023 17:21:43 +0000
Subject: [PATCH v43 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2a5352f879..bb0de527ab 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4923,3 +4932,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..b3fe5dbeb2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1570,7 +1571,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1654,20 +1654,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v43-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v43-0003-Allow-logical-decoding-on-standby.patchDownload

From 6e84c36606ed8fce8d5e39e066d486728212f6b8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 25 Jan 2023 17:20:47 +0000
Subject: [PATCH v43 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.8% src/backend/access/transam/
  38.4% src/backend/replication/logical/
  55.8% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a071fc6871..f14e1755b7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4465,6 +4465,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..c1e43dd2b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires "
+								"wal_level to be at least logical on master")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..93a4fcf15a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires "
+							"wal_level to be at least logical on master")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 7ac6bfe059..095a7a7be6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v43-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v43-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 43f2282ccf68c5112b3462beb64ddc646c3886e3 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 25 Jan 2023 17:17:44 +0000
Subject: [PATCH v43 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on master
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   3 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   3 +-
 22 files changed, 274 insertions(+), 56 deletions(-)
   3.3% doc/src/sgml/
   8.2% src/backend/access/transam/
   5.1% src/backend/replication/logical/
  58.5% src/backend/replication/
   5.0% src/backend/storage/ipc/
   7.0% src/backend/tcop/
   5.3% src/backend/
   3.4% src/include/replication/
   3.3% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 743ee363c5..04a9b40271 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d478724b9d..d64fb4cc84 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8891,6 +8891,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9060,6 +9061,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9177,6 +9179,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..a071fc6871 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6447,6 +6447,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7086,6 +7088,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7251,7 +7254,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7966,6 +7970,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..4bd1aa401a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1065,7 +1065,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..7ac6bfe059 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteOrConflictingLogicalReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on master"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..daba766947 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteOrConflictingLogicalReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..659e5bdc3a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..857076abf5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteOrConflictingLogicalReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..7d4831dffe 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,7 +1868,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v43-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v43-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From c1a37da606f9b2356c3fe39eeb2b4859556f14b8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Wed, 25 Jan 2023 17:11:58 +0000
Subject: [PATCH v43 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/amcheck/verify_nbtree.c            | 17 ++--
 src/backend/access/gist/gist.c             |  5 +-
 src/backend/access/gist/gistbuild.c        |  2 +-
 src/backend/access/gist/gistutil.c         |  4 +-
 src/backend/access/gist/gistxlog.c         | 17 ++--
 src/backend/access/hash/hash_xlog.c        | 12 +--
 src/backend/access/hash/hashinsert.c       |  1 +
 src/backend/access/heap/heapam.c           |  5 +-
 src/backend/access/heap/heapam_handler.c   |  9 +-
 src/backend/access/heap/pruneheap.c        |  1 +
 src/backend/access/heap/vacuumlazy.c       |  2 +
 src/backend/access/heap/visibilitymap.c    |  3 +-
 src/backend/access/nbtree/nbtinsert.c      | 82 +++++++++---------
 src/backend/access/nbtree/nbtpage.c        | 99 ++++++++++++----------
 src/backend/access/nbtree/nbtree.c         |  4 +-
 src/backend/access/nbtree/nbtsearch.c      | 45 +++++-----
 src/backend/access/nbtree/nbtsort.c        |  2 +-
 src/backend/access/nbtree/nbtutils.c       |  7 +-
 src/backend/access/spgist/spgvacuum.c      |  9 +-
 src/backend/catalog/index.c                |  1 +
 src/backend/commands/analyze.c             |  1 +
 src/backend/commands/vacuumparallel.c      |  6 ++
 src/backend/optimizer/util/plancat.c       |  2 +-
 src/backend/utils/sort/tuplesortvariants.c |  7 +-
 src/include/access/genam.h                 |  1 +
 src/include/access/gist_private.h          |  7 +-
 src/include/access/gistxlog.h              | 11 +--
 src/include/access/hash_xlog.h             |  8 +-
 src/include/access/heapam_xlog.h           |  8 +-
 src/include/access/nbtree.h                | 31 ++++---
 src/include/access/nbtxlog.h               |  6 +-
 src/include/access/spgxlog.h               |  1 +
 src/include/access/visibilitymapdefs.h     |  9 +-
 src/include/utils/rel.h                    |  1 +
 src/include/utils/tuplesort.h              |  3 +-
 35 files changed, 240 insertions(+), 189 deletions(-)
   4.7% contrib/amcheck/
   6.3% src/backend/access/gist/
   5.2% src/backend/access/heap/
  54.6% src/backend/access/nbtree/
   4.9% src/backend/access/
   3.2% src/backend/
  20.1% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..8d3abbdceb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,8 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
-													IndexTuple itup);
+													IndexTuple itup,
+													Relation heaprel);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, &heapkeyspace, &allequalimage, heaprel);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup, state->heaprel);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup, state->heaprel);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, state->heaprel);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL, state->heaprel);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, heaprel);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..f8bc488d4f 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..a87890b965 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..119e34ce0f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heaprel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heaprel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..743ee363c5 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation heaprel, Relation rel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e6024a980b..d478724b9d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6872,6 +6872,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8442,7 +8443,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8454,6 +8455,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..30730c24bf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,11 +720,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
 											maintenance_work_mem,
-											NULL, TUPLESORT_NONE);
+											NULL, TUPLESORT_NONE, NewHeap);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..c48b272431 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate,
+								 Relation heaprel);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,7 +42,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -50,14 +51,15 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Size itemsz,
 						   OffsetNumber newitemoff,
 						   int postingoff,
-						   bool split_only_page);
+						   bool split_only_page,
+						   Relation heaprel);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
 						IndexTuple newitem, IndexTuple orignewitem,
-						IndexTuple nposting, uint16 postingoff);
+						IndexTuple nposting, uint16 postingoff, Relation heaprel);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+							  BTStack stack, bool isroot, bool isonly, Relation heaprel);
+static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, heapRel);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, &insertstate, heapRel);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -257,7 +259,7 @@ search:
 									   indexUnchanged, stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
 					   itup, insertstate.itemsz, newitemoff,
-					   insertstate.postingoff, false);
+					   insertstate.postingoff, false, heapRel);
 	}
 	else
 	{
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, BTInsertState insertstate, Relation heaprel)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -376,7 +378,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
 	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+					  NULL, heaprel);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, insertstate, stack, heapRel);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, insertstate, stack, heapRel);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, rbuf, stack, heaprel);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1107,7 +1109,8 @@ _bt_insertonpg(Relation rel,
 			   Size itemsz,
 			   OffsetNumber newitemoff,
 			   int postingoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1210,7 +1213,7 @@ _bt_insertonpg(Relation rel,
 
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+						 origitup, nposting, postingoff, heaprel);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly, heaprel);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1461,7 +1464,8 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff,
+		  Relation heaprel)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1712,7 +1716,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1889,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE, heaprel);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2096,7 +2100,8 @@ _bt_insert_parent(Relation rel,
 				  Buffer rbuf,
 				  BTStack stack,
 				  bool isroot,
-				  bool isonly)
+				  bool isonly,
+				  Relation heaprel)
 {
 	/*
 	 * Here we have to do something Lehman and Yao don't talk about: deal with
@@ -2118,7 +2123,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, buf, rbuf, heaprel);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2162,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL,
+									heaprel);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2189,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, stack, bknum, heaprel);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2209,7 +2215,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
-					   stack->bts_offset + 1, 0, isonly);
+					   stack->bts_offset + 1, 0, isonly, heaprel);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2227,7 +2233,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack, Relation heaprel)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2246,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE, heaprel);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2258,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2275,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly, heaprel);
 }
 
 /*
@@ -2304,7 +2310,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child, Relation heaprel)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2324,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, blkno, BT_WRITE, heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, buf, stack->bts_parent, heaprel);
 			continue;
 		}
 
@@ -2428,7 +2434,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2460,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edca7aebb2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -39,16 +39,19 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   FullTransactionId safexid);
+							   FullTransactionId safexid,
+							   Relation heaprel);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heaprel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+								   BTStack stack,
+								   Relation heaprel);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
@@ -58,7 +61,8 @@ static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									Buffer *subtreeparent,
 									OffsetNumber *poffset,
 									BlockNumber *topparent,
-									BlockNumber *topparentrightsib);
+									BlockNumber *topparentrightsib,
+									Relation heaprel);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
 							   FullTransactionId safexid);
 
@@ -178,7 +182,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +195,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +235,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +259,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +344,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, int access, Relation heaprel)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +374,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ, heaprel);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +400,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +433,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, access, heaprel);
 		}
 
 		/*
@@ -437,7 +441,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +578,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +600,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +673,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +681,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +737,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +745,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +829,7 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid, Relation heaprel)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +840,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +873,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access, Relation heaprel)
 {
 	Buffer		buf;
 
@@ -944,7 +949,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 						_bt_log_reuse_page(rel, blkno,
-										   BTPageGetDeleteXid(page));
+										   BTPageGetDeleteXid(page), heaprel);
 
 					/* Okay to use page.  Re-initialize and return it. */
 					_bt_pageinit(page, BufferGetPageSize(buf));
@@ -1296,7 +1301,7 @@ static void
 _bt_delitems_delete(Relation rel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable, Relation heaprel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1363,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+						deletable, ndeletable, updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1712,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1723,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, leftsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1769,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1778,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, leafrightsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1967,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno, vstate->info->heaprel))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, vstate->info->heaprel);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL,
+								   vstate->info->heaprel);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2009,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, leafbuf, stack, vstate->info->heaprel))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2072,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	}
 }
 
@@ -2084,7 +2091,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack, Relation heaprel)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2126,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, leafrightsib, heaprel))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2145,7 +2152,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	topparentrightsib = leafrightsib;
 	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
 								 &subtreeparent, &poffset,
-								 &topparent, &topparentrightsib))
+								 &topparent, &topparentrightsib, heaprel))
 		return false;
 
 	/*
@@ -2363,7 +2370,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, target, BT_READ, vstate->info->heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2397,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2447,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2511,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2540,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE,
+								 vstate->info->heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2775,7 +2783,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 static bool
 _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+						BlockNumber *topparent, BlockNumber *topparentrightsib,
+						Relation heaprel)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2798,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, stack, child, heaprel);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,13 +2898,13 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, leftsibparent, parent, heaprel))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
 	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
-								   topparent, topparentrightsib);
+								   topparent, topparentrightsib, heaprel);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..705716e333 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, num_delpages, info->heaprel);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..6466fe2f58 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -94,13 +94,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  */
 BTStack
 _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+		   Snapshot snapshot, Relation heaprel)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, access, heaprel);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -130,7 +130,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+							  page_access, snapshot, heaprel);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -191,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * move right to its new sibling.  Do that.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
-							  snapshot);
+							  snapshot, heaprel);
 	}
 
 	return stack_in;
@@ -239,7 +239,8 @@ _bt_moveright(Relation rel,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
-			  Snapshot snapshot)
+			  Snapshot snapshot,
+			  Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -288,12 +289,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, buf, stack, heaprel);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access, heaprel);
 			continue;
 		}
 
@@ -860,6 +861,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1354,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage, heaprel);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1365,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot, heaprel);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2006,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ, scan->heapRelation);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2080,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ,
+										 scan->heapRelation);
 
 		for (;;)
 		{
@@ -2093,7 +2096,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 			/* Step to next physical page */
 			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+											scan->xs_snapshot, scan->heapRelation);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2143,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ,
+											 scan->heapRelation);
 			}
 		}
 	}
@@ -2185,7 +2189,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2217,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ, heaprel);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2305,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  */
 Buffer
 _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-				 Snapshot snapshot)
+				 Snapshot snapshot, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -2320,9 +2324,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, BT_READ, heaprel);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2407,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot,
+						   scan->heapRelation);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..542029eec7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, btspool->heap);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..9edd856371 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage, heaprel);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ,
+						 scan->heapRelation);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..a8fc221153 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Buffer buffer, Relation heaprel)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..7d9c1c7eca 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -208,7 +208,8 @@ Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem,
-						SortCoordinate coordinate, int sortopt)
+						SortCoordinate coordinate, int sortopt,
+						Relation heaprel)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heaprel);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heapRel);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..b76ed4c6f8 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..8f1dfedaaf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool		isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..cdcfdd6030 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack,
+							 Relation heaprel);
+extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child,
+							  Relation heaprel);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
+								 Relation heaprel);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
+extern Buffer _bt_getroot(Relation rel, int access, Relation heaprel);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
 extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
-							bool *allequalimage);
+							bool *allequalimage, Relation heaprel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access,
+						 Relation heaprel);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1230,20 +1234,21 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
+						  int access, Snapshot snapshot, Relation heaprel);
 extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+							bool forupdate, BTStack stack, int access,
+							Snapshot snapshot, Relation heaprel);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+							   Snapshot snapshot, Relation heaprel);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..06aebe6330 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -401,7 +401,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 											   Relation indexRel, int workMem,
 											   SortCoordinate coordinate,
-											   int sortopt);
+											   int sortopt,
+											   Relation heaprel);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
-- 
2.34.1

#222

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#221)

Re: Minimal logical decoding on standbys

Hi,

On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote:

- I'm struggling to create a test for btree killtuples as there is a need for rows removal on the table (that could produce a conflict too):
Do you've a scenario in mind for this one? (and btw in what kind of WAL record should the conflict be detected in such a case? xl_btree_delete?)

Hm, it might indeed be hard in "modern" postgres. I think you'd need at least
two concurrent sessions, to prevent on-access pruning on the table.

DROP TABLE IF EXISTS indexdel;
CREATE TABLE indexdel(id int8 primary key);
INSERT INTO indexdel SELECT generate_series(1, 10000);
VACUUM indexdel; -- ensure hint bits are set etc

DELETE FROM indexdel;

SELECT pg_current_wal_insert_lsn();

SET enable_indexonlyscan = false;
-- This scan finds that the index items are dead - but doesn't yet issue a
-- btree delete WAL record, that only happens when needing space on the page
-- again.
EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY id ASC;
SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC;

-- The insertions into the range of values prev
INSERT INTO indexdel SELECT generate_series(1, 100);

Does generate the btree deletion record, but it also does emit a PRUNE (from
heapam_index_fetch_tuple() -> heap_page_prune_opt()).

While the session could create a cursor to prevent later HOT cleanup, the
query would also trigger hot pruning (or prevent the rows from being dead, if
you declare the cursor before the DELETE). So you'd need overlapping cursors
in a concurrent session...

Too complicated.

Greetings,

Andres Freund

#223

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#222)

Re: Minimal logical decoding on standbys

Hi,

On 1/26/23 9:13 PM, Andres Freund wrote:

Hi,

On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote:

- I'm struggling to create a test for btree killtuples as there is a need for rows removal on the table (that could produce a conflict too):
Do you've a scenario in mind for this one? (and btw in what kind of WAL record should the conflict be detected in such a case? xl_btree_delete?)

Hm, it might indeed be hard in "modern" postgres. I think you'd need at least
two concurrent sessions, to prevent on-access pruning on the table.

DROP TABLE IF EXISTS indexdel;
CREATE TABLE indexdel(id int8 primary key);
INSERT INTO indexdel SELECT generate_series(1, 10000);
VACUUM indexdel; -- ensure hint bits are set etc

DELETE FROM indexdel;

SELECT pg_current_wal_insert_lsn();

SET enable_indexonlyscan = false;
-- This scan finds that the index items are dead - but doesn't yet issue a
-- btree delete WAL record, that only happens when needing space on the page
-- again.
EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY id ASC;
SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC;

-- The insertions into the range of values prev
INSERT INTO indexdel SELECT generate_series(1, 100);

Does generate the btree deletion record, but it also does emit a PRUNE (from
heapam_index_fetch_tuple() -> heap_page_prune_opt()).

While the session could create a cursor to prevent later HOT cleanup, the
query would also trigger hot pruning (or prevent the rows from being dead, if
you declare the cursor before the DELETE). So you'd need overlapping cursors
in a concurrent session...

Thanks for the scenario and explanation!

I agree that a second session would be needed (and so I understand why I was
struggling when trying with a single session ;-) )

Too complicated.

Yeah.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#224

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#203)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/11/23 5:52 PM, Andres Freund wrote:

Hi,

On 2023-01-11 13:02:13 +0530, Bharath Rupireddy wrote:

3. Is this feature still a 'minimal logical decoding on standby'?
Firstly, why is it 'minimal'?

It's minimal in comparison to other proposals at the time that did explicit /
active coordination between primary and standby to allow logical decoding.
0002:
1.
-    if (InvalidateObsoleteReplicationSlots(_logSegNo))
+    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
&invalidated, InvalidOid, NULL);
Isn't the function name too long and verbose?
+1

How about just InvalidateLogicalReplicationSlots() let the function comment
talk about what sorts of replication slots it invalides?

I'd just leave the name unmodified at InvalidateObsoleteReplicationSlots().

Done in V44 attached.

2.
+                                errdetail("Logical decoding on
standby requires wal_level to be at least logical on master"));
+ *     master wal_level is set back to replica, so existing logical
slots need to
invalidate such slots. Also do the same thing if wal_level on master
Can we use 'primary server' instead of 'master' like elsewhere? This
comment also applies for other patches too, if any.
+1

Done in V44.

3. Can we show a new status in pg_get_replication_slots's wal_status
for invalidated due to the conflict so that the user can monitor for
the new status and take necessary actions?

Invalidated slots are not a new concept introduced in this patchset, so I'd
say we can introduce such a field separately.

In V44, adding a new field "conflicting" in pg_replication_slots which is:

- NULL for physical slots
- True if the slot is a logical one and has been invalidated due to recovery conflict
- False if the slot is a logical one and has not been invalidated due to recovery conflict

I'm not checking if recovery is in progress while displaying the "conflicting". The reason is to still display
the right status after a promotion.

TAP tests are also updated to test that this new field behaves as expected (for both physical and logical slots).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v44-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v44-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 84ee1e9e5a69590abfe33ba4fa3929f03f3b4074 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 27 Jan 2023 11:32:44 +0000
Subject: [PATCH v44 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..2e8bee033f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v44-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v44-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 712f9dfe724ae0f8be9ce151de4a547e00cb9f8f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 27 Jan 2023 11:31:47 +0000
Subject: [PATCH v44 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  39 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 658 ++++++++++++++++++
 3 files changed, 698 insertions(+)
   5.0% src/test/perl/PostgreSQL/Test/
  94.8% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..fd81ddcf39 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,45 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	# Note: Write a C helper function to call LogStandbySnapshot() instead
+	# of asking for a checkpoint.
+	$primary->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4370d595d8
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,658 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 62;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+$handle = make_slot_active(1);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
-- 
2.34.1

v44-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v44-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From a8b5832e9e5843caa22e0b283cf5ed4118aa55c4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 27 Jan 2023 09:56:44 +0000
Subject: [PATCH v44 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2a5352f879..bb0de527ab 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4923,3 +4932,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..b3fe5dbeb2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1570,7 +1571,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1654,20 +1654,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v44-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v44-0003-Allow-logical-decoding-on-standby.patchDownload

From 28cba7f1e6799d3190c416dc324762da53091663 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 27 Jan 2023 09:55:44 +0000
Subject: [PATCH v44 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 867675d5a1..1abe747cb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4465,6 +4465,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..6b66a971ba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..91acc0c155 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v44-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v44-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 84fcef80f9a0b1037ce4c8208caae583de33e098 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 27 Jan 2023 09:50:51 +0000
Subject: [PATCH v44 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 743ee363c5..04a9b40271 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d478724b9d..d64fb4cc84 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8891,6 +8891,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9060,6 +9061,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9177,6 +9179,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..867675d5a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6447,6 +6447,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7086,6 +7088,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7251,7 +7254,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7966,6 +7970,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a272bd4a88 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..ec817381a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..c8e11ab710 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10946,9 +10951,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..11ea206337 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v44-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v44-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 083ab448e6940d5f53c303958a4f8434281fd2a9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Fri, 27 Jan 2023 09:48:07 +0000
Subject: [PATCH v44 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/amcheck/verify_nbtree.c            | 17 ++--
 src/backend/access/gist/gist.c             |  5 +-
 src/backend/access/gist/gistbuild.c        |  2 +-
 src/backend/access/gist/gistutil.c         |  4 +-
 src/backend/access/gist/gistxlog.c         | 17 ++--
 src/backend/access/hash/hash_xlog.c        | 12 +--
 src/backend/access/hash/hashinsert.c       |  1 +
 src/backend/access/heap/heapam.c           |  5 +-
 src/backend/access/heap/heapam_handler.c   |  9 +-
 src/backend/access/heap/pruneheap.c        |  1 +
 src/backend/access/heap/vacuumlazy.c       |  2 +
 src/backend/access/heap/visibilitymap.c    |  3 +-
 src/backend/access/nbtree/nbtinsert.c      | 82 +++++++++---------
 src/backend/access/nbtree/nbtpage.c        | 99 ++++++++++++----------
 src/backend/access/nbtree/nbtree.c         |  4 +-
 src/backend/access/nbtree/nbtsearch.c      | 45 +++++-----
 src/backend/access/nbtree/nbtsort.c        |  2 +-
 src/backend/access/nbtree/nbtutils.c       |  7 +-
 src/backend/access/spgist/spgvacuum.c      |  9 +-
 src/backend/catalog/index.c                |  1 +
 src/backend/commands/analyze.c             |  1 +
 src/backend/commands/vacuumparallel.c      |  6 ++
 src/backend/optimizer/util/plancat.c       |  2 +-
 src/backend/utils/sort/tuplesortvariants.c |  7 +-
 src/include/access/genam.h                 |  1 +
 src/include/access/gist_private.h          |  7 +-
 src/include/access/gistxlog.h              | 11 +--
 src/include/access/hash_xlog.h             |  8 +-
 src/include/access/heapam_xlog.h           |  8 +-
 src/include/access/nbtree.h                | 31 ++++---
 src/include/access/nbtxlog.h               |  6 +-
 src/include/access/spgxlog.h               |  1 +
 src/include/access/visibilitymapdefs.h     |  9 +-
 src/include/utils/rel.h                    |  1 +
 src/include/utils/tuplesort.h              |  3 +-
 35 files changed, 240 insertions(+), 189 deletions(-)
   4.7% contrib/amcheck/
   6.3% src/backend/access/gist/
   5.2% src/backend/access/heap/
  54.6% src/backend/access/nbtree/
   4.9% src/backend/access/
   3.2% src/backend/
  20.1% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..8d3abbdceb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,8 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
-													IndexTuple itup);
+													IndexTuple itup,
+													Relation heaprel);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, &heapkeyspace, &allequalimage, heaprel);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup, state->heaprel);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup, state->heaprel);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, state->heaprel);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL, state->heaprel);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, heaprel);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..f8bc488d4f 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(heapRel, rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..a87890b965 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(heap, index);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..119e34ce0f 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation heaprel, Relation r)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(heaprel, r, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..743ee363c5 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation heaprel, Relation rel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e6024a980b..d478724b9d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6872,6 +6872,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8442,7 +8443,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8454,6 +8455,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..30730c24bf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,11 +720,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
 											maintenance_work_mem,
-											NULL, TUPLESORT_NONE);
+											NULL, TUPLESORT_NONE, NewHeap);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..c48b272431 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate,
+								 Relation heaprel);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,7 +42,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -50,14 +51,15 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Size itemsz,
 						   OffsetNumber newitemoff,
 						   int postingoff,
-						   bool split_only_page);
+						   bool split_only_page,
+						   Relation heaprel);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
 						IndexTuple newitem, IndexTuple orignewitem,
-						IndexTuple nposting, uint16 postingoff);
+						IndexTuple nposting, uint16 postingoff, Relation heaprel);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+							  BTStack stack, bool isroot, bool isonly, Relation heaprel);
+static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, heapRel);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, &insertstate, heapRel);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -257,7 +259,7 @@ search:
 									   indexUnchanged, stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
 					   itup, insertstate.itemsz, newitemoff,
-					   insertstate.postingoff, false);
+					   insertstate.postingoff, false, heapRel);
 	}
 	else
 	{
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, BTInsertState insertstate, Relation heaprel)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -376,7 +378,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
 	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+					  NULL, heaprel);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, insertstate, stack, heapRel);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, insertstate, stack, heapRel);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, rbuf, stack, heaprel);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1107,7 +1109,8 @@ _bt_insertonpg(Relation rel,
 			   Size itemsz,
 			   OffsetNumber newitemoff,
 			   int postingoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1210,7 +1213,7 @@ _bt_insertonpg(Relation rel,
 
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+						 origitup, nposting, postingoff, heaprel);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly, heaprel);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1461,7 +1464,8 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff,
+		  Relation heaprel)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1712,7 +1716,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1889,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE, heaprel);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2096,7 +2100,8 @@ _bt_insert_parent(Relation rel,
 				  Buffer rbuf,
 				  BTStack stack,
 				  bool isroot,
-				  bool isonly)
+				  bool isonly,
+				  Relation heaprel)
 {
 	/*
 	 * Here we have to do something Lehman and Yao don't talk about: deal with
@@ -2118,7 +2123,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, buf, rbuf, heaprel);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2162,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL,
+									heaprel);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2189,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, stack, bknum, heaprel);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2209,7 +2215,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
-					   stack->bts_offset + 1, 0, isonly);
+					   stack->bts_offset + 1, 0, isonly, heaprel);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2227,7 +2233,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack, Relation heaprel)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2246,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE, heaprel);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2258,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2275,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly, heaprel);
 }
 
 /*
@@ -2304,7 +2310,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child, Relation heaprel)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2324,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, blkno, BT_WRITE, heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, buf, stack->bts_parent, heaprel);
 			continue;
 		}
 
@@ -2428,7 +2434,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf, Relation heaprel)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2460,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..edca7aebb2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -39,16 +39,19 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   FullTransactionId safexid);
+							   FullTransactionId safexid,
+							   Relation heaprel);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
-								BTVacuumPosting *updatable, int nupdatable);
+								BTVacuumPosting *updatable, int nupdatable,
+								Relation heaprel);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+								   BTStack stack,
+								   Relation heaprel);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
@@ -58,7 +61,8 @@ static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									Buffer *subtreeparent,
 									OffsetNumber *poffset,
 									BlockNumber *topparent,
-									BlockNumber *topparentrightsib);
+									BlockNumber *topparentrightsib,
+									Relation heaprel);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
 							   FullTransactionId safexid);
 
@@ -178,7 +182,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +195,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +235,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +259,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +344,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, int access, Relation heaprel)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +374,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ, heaprel);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +400,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +433,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, access, heaprel);
 		}
 
 		/*
@@ -437,7 +441,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE, heaprel);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +578,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +600,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +673,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +681,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +737,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +745,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ, heaprel);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +829,7 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid, Relation heaprel)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +840,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +873,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access, Relation heaprel)
 {
 	Buffer		buf;
 
@@ -944,7 +949,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 						_bt_log_reuse_page(rel, blkno,
-										   BTPageGetDeleteXid(page));
+										   BTPageGetDeleteXid(page), heaprel);
 
 					/* Okay to use page.  Re-initialize and return it. */
 					_bt_pageinit(page, BufferGetPageSize(buf));
@@ -1296,7 +1301,7 @@ static void
 _bt_delitems_delete(Relation rel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
-					BTVacuumPosting *updatable, int nupdatable)
+					BTVacuumPosting *updatable, int nupdatable, Relation heaprel)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
@@ -1358,6 +1363,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1685,7 +1691,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
 	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+						deletable, ndeletable, updatable, nupdatable, heapRel);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1712,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1723,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, leftsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1769,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1778,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, leafrightsib, BT_READ, heaprel);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1967,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno, vstate->info->heaprel))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, vstate->info->heaprel);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL,
+								   vstate->info->heaprel);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2009,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, leafbuf, stack, vstate->info->heaprel))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2072,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	}
 }
 
@@ -2084,7 +2091,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack, Relation heaprel)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2126,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, leafrightsib, heaprel))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2145,7 +2152,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	topparentrightsib = leafrightsib;
 	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
 								 &subtreeparent, &poffset,
-								 &topparent, &topparentrightsib))
+								 &topparent, &topparentrightsib, heaprel))
 		return false;
 
 	/*
@@ -2363,7 +2370,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, target, BT_READ, vstate->info->heaprel);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2397,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2447,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE, vstate->info->heaprel);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2511,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE, vstate->info->heaprel);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2540,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE,
+								 vstate->info->heaprel);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2775,7 +2783,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 static bool
 _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+						BlockNumber *topparent, BlockNumber *topparentrightsib,
+						Relation heaprel)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2798,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, stack, child, heaprel);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,13 +2898,13 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, leftsibparent, parent, heaprel))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
 	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
-								   topparent, topparentrightsib);
+								   topparent, topparentrightsib, heaprel);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..705716e333 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, num_delpages, info->heaprel);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..6466fe2f58 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -94,13 +94,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  */
 BTStack
 _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+		   Snapshot snapshot, Relation heaprel)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, access, heaprel);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -130,7 +130,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+							  page_access, snapshot, heaprel);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -191,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * move right to its new sibling.  Do that.
 		 */
 		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
-							  snapshot);
+							  snapshot, heaprel);
 	}
 
 	return stack_in;
@@ -239,7 +239,8 @@ _bt_moveright(Relation rel,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
-			  Snapshot snapshot)
+			  Snapshot snapshot,
+			  Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -288,12 +289,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, buf, stack, heaprel);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access, heaprel);
 			continue;
 		}
 
@@ -860,6 +861,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1354,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage, heaprel);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1365,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot, heaprel);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2006,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ, scan->heapRelation);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2080,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ,
+										 scan->heapRelation);
 
 		for (;;)
 		{
@@ -2093,7 +2096,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 			/* Step to next physical page */
 			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+											scan->xs_snapshot, scan->heapRelation);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2143,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ,
+											 scan->heapRelation);
 			}
 		}
 	}
@@ -2185,7 +2189,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot, Relation heaprel)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2217,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ, heaprel);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2305,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  */
 Buffer
 _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-				 Snapshot snapshot)
+				 Snapshot snapshot, Relation heaprel)
 {
 	Buffer		buf;
 	Page		page;
@@ -2320,9 +2324,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, BT_READ, heaprel);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2407,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot,
+						   scan->heapRelation);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..542029eec7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, btspool->heap);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..9edd856371 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage, heaprel);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ,
+						 scan->heapRelation);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..a8fc221153 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Buffer buffer, Relation heaprel)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, buffer, bds->info->heaprel);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..7d9c1c7eca 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -208,7 +208,8 @@ Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem,
-						SortCoordinate coordinate, int sortopt)
+						SortCoordinate coordinate, int sortopt,
+						Relation heaprel)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heaprel);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, heapRel);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..b76ed4c6f8 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +100,10 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..8f1dfedaaf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,12 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
-
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	bool		isCatalogRel;
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..1d43181a40 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,11 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +345,13 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +410,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..cdcfdd6030 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack,
+							 Relation heaprel);
+extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child,
+							  Relation heaprel);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
+								 Relation heaprel);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
+extern Buffer _bt_getroot(Relation rel, int access, Relation heaprel);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
 extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
-							bool *allequalimage);
+							bool *allequalimage, Relation heaprel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access,
+						 Relation heaprel);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1230,20 +1234,21 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
+						  int access, Snapshot snapshot, Relation heaprel);
 extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+							bool forupdate, BTStack stack, int access,
+							Snapshot snapshot, Relation heaprel);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+							   Snapshot snapshot, Relation heaprel);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, Relation heaprel);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..99d87d7189 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,10 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +236,14 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..29a6aa57a9 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,7 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..b27fdc0aef 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,10 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL	0x04
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..06aebe6330 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -401,7 +401,8 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 											   Relation indexRel, int workMem,
 											   SortCoordinate coordinate,
-											   int sortopt);
+											   int sortopt,
+											   Relation heaprel);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
-- 
2.34.1

#225

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#220)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

Just realized that Melanie was missing in the up-thread reply to
her feedback (not sure what happened, sorry about that).... So, adding her here.

Please find attached V45 addressing Melanie's feedback.

On 1/24/23 3:59 PM, Drouvot, Bertrand wrote:

Hi,

On 1/24/23 12:21 AM, Melanie Plageman wrote:

I'm new to this thread and subject, but I had a few basic thoughts about
the first patch in the set.

Thanks for looking at it!

On Mon, Jan 23, 2023 at 12:03:35PM +0100, Drouvot, Bertrand wrote:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.

It is a little confusing to mention replication conflicts in point 1. It
makes it sound like it already logs a recovery conflict. Without the
recovery conflict handling in this patchset, logical decoding of
statements using data that has been removed will fail with some error
like :
ERROR: could not map filenumber "xxx" to relation OID
Part of what this patchset does is introduce the concept of a new kind
of recovery conflict and a resolution process.

Changed the wording in V45's commit message.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.

It isn't clear from the above sentence why you would need both. I think
it has something to do with what is below (hot_standby_feedback being
off), but I'm not sure, so the order is confusing.

Trying to be more clear in the new commit message.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

You do mention it a bit here, but I think it could be more clear and
specific.

Added the WAL record types impacted by the change in the new commit message.

It is not very important, but I noticed you made "heaprel" the last
parameter to all of the btree-related functions but the first parameter
to the gist functions. I thought it might be nice to make the order
consistent.

Agree, will do.

Done.

I also was wondering why you made it the last argument to
all the btree functions to begin with (i.e. instead of directly after
the first rel argument).

No real reasons, will put all of them after the first rel argument (that seems a better place).

Done.

diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..9bdac12baf 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
                                       FullTransactionId xid, Buffer parentBuffer,
                                       OffsetNumber downlinkOffset);
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
                                FullTransactionId deleteXid);
  extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
  extern bool gistfitpage(IndexTuple *itvec, int len);
  extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
  extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation heaprel, Relation r);
  extern bool gistPageRecyclable(Page page);
  extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
                             OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..191f0e5808 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,13 @@ typedef struct gistxlogDelete
  {
      TransactionId snapshotConflictHorizon;
      uint16        ntodelete;        /* number of deleted offsets */
+    bool        isCatalogRel;

In some of these struct definitions, I think it would help comprehension
to have a comment explaining the purpose of this member.

Yeah, agree but it could be done in another patch (outside of this feature), agree?

Please forget about my previous reply (I misunderstood and thought you were mentioning the offset's Array).

Added comments about isCatalogRel in V45 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v45-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v45-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From f45d02edeeed24e604c1e48cf39c09224eac1644 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 30 Jan 2023 15:44:00 +0000
Subject: [PATCH v45 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..2e8bee033f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v45-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v45-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 61471ca687aa95eb334fd23448836bc8100ed92f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 30 Jan 2023 15:43:21 +0000
Subject: [PATCH v45 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  39 ++
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 658 ++++++++++++++++++
 3 files changed, 698 insertions(+)
   5.0% src/test/perl/PostgreSQL/Test/
  94.8% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..fd81ddcf39 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,45 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	# Note: Write a C helper function to call LogStandbySnapshot() instead
+	# of asking for a checkpoint.
+	$primary->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index edaaa1a3ce..52b2816c7a 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..4370d595d8
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,658 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 62;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+$handle = make_slot_active(1);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
-- 
2.34.1

v45-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v45-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 31f671b55fc9eb9eba2799182ae50c20ba6ed6c9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 30 Jan 2023 15:42:26 +0000
Subject: [PATCH v45 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++++++
 src/backend/replication/walsender.c       | 31 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2a5352f879..bb0de527ab 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4923,3 +4932,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+getReplayedCV(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..b3fe5dbeb2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = getReplayedCV();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1570,7 +1571,6 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	for (;;)
 	{
-		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1654,20 +1654,35 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndKeepaliveIfNecessary();
 
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		if (!RecoveryInProgress())
+		{
+			long		sleeptime;
+			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime * 10, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{
+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..b65c2cf1f0 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *getReplayedCV(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v45-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v45-0003-Allow-logical-decoding-on-standby.patchDownload

From d184a98f9c9aeefdcfbeeebab3aba4bf0cee9815 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 30 Jan 2023 15:41:38 +0000
Subject: [PATCH v45 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 867675d5a1..1abe747cb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4465,6 +4465,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..6b66a971ba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..91acc0c155 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v45-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v45-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 700c26ef8ee1c5caa40c6e0ce14796f7662c6f97 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 30 Jan 2023 15:40:56 +0000
Subject: [PATCH v45 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d478724b9d..d64fb4cc84 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8891,6 +8891,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9060,6 +9061,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9177,6 +9179,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..867675d5a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6447,6 +6447,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7086,6 +7088,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7251,7 +7254,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7966,6 +7970,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a272bd4a88 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..ec817381a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..c8e11ab710 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10946,9 +10951,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..11ea206337 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v45-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v45-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 7ab0a6ec046f0c181c8f3b96cb18ae938f573ba9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 30 Jan 2023 15:30:15 +0000
Subject: [PATCH v45 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  13 ++-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 203 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..4462022904 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..aad14a401d 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e6024a980b..d478724b9d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6872,6 +6872,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8442,7 +8443,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8454,6 +8455,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..4e8a85fb5d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..8c58fdb8d1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..70a0c2418a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2eea866f06 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..7e9e47ce67 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool		isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..223db4b199 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +346,14 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +412,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..1e45d58845 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +237,15 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#226

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#198)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:

Hi,
On 2023-01-05 16:15:39 -0500, Robert Haas wrote:

On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

0006:

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..98c96eb864 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
RecoveryPauseState recoveryPauseState;
ConditionVariable recoveryNotPausedCV;

+	/* Replay state (see getReplayedCV() for more explanation) */
+	ConditionVariable replayedCV;
+
slock_t		info_lck;		/* locks shared variables shown above */
} XLogRecoveryCtlData;

getReplayedCV() doesn't seem to fit into any of the naming scheems in use for
xlogrecovery.h.

Changed to check_for_replay() in V46 attached.

-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.

Hm. Is there a problem with not handling the becoming-writable case in the
in-recovery case?

Yes, when not in recovery we'd wait for the timeout to occur in ConditionVariableTimedSleep()
(as the CV is broadcasted only in ApplyWalRecord()).

+		else
+		/*
+		 * We are in the logical decoding on standby case.
+		 * We are waiting for the startup process to replay wal record(s) using
+		 * a timeout in case we are requested to stop.
+		 */
+		{

I don't think pgindent will like that formatting....

Oops, fixed.

+			ConditionVariablePrepareToSleep(replayedCV);
+			ConditionVariableTimedSleep(replayedCV, 1000,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
I think this is racy, see ConditionVariablePrepareToSleep()'s comment:

* Caution: "before entering the loop" means you *must* test the exit
* condition between calling ConditionVariablePrepareToSleep and calling
* ConditionVariableSleep. If that is inconvenient, omit calling
* ConditionVariablePrepareToSleep.

Basically, the ConditionVariablePrepareToSleep() should be before the loop
body.

I missed it, thanks! Moved it before the loop body.

I don't think the fixed timeout here makes sense. For one, we need to wake up
based on WalSndComputeSleeptime(), otherwise we're ignoring wal_sender_timeout
(which can be quite small).

Good point. Making use of WalSndComputeSleeptime() instead in V46.

It's also just way too frequent - we're trying to
avoid constantly waking up unnecessarily.

Perhaps we could deal with the pq_is_send_pending() issue by having a version
of ConditionVariableTimedSleep() that accepts a WaitEventSet?

What issue do you see?
The one that I see with V46 (keeping the in/not recovery branches) is that one may need to wait
for wal_sender_timeout to see changes that occurred right after the promotion.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v46-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v46-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 796efdfb0c4d367a7ec4d433a9111c09cad769f5 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 31 Jan 2023 10:11:40 +0000
Subject: [PATCH v46 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..2e8bee033f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v46-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v46-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From f6dbf79a137c618c3dfa0bdd568f66866a88fcfa Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 31 Jan 2023 10:03:50 +0000
Subject: [PATCH v46 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  39 +
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 665 ++++++++++++++++++
 3 files changed, 705 insertions(+)
   4.9% src/test/perl/PostgreSQL/Test/
  94.8% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..fd81ddcf39 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,45 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	# Note: Write a C helper function to call LogStandbySnapshot() instead
+	# of asking for a checkpoint.
+	$primary->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 209118a639..eca90c5c8c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -39,6 +39,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..690170daaa
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,665 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 62;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+$handle = make_slot_active(1);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
-- 
2.34.1

v46-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v46-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From ddb877d6a780065b25824008b941ba6daa5193ee Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 31 Jan 2023 10:02:35 +0000
Subject: [PATCH v46 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2a5352f879..30dddda5f8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4923,3 +4932,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..3fc7b42d15 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v46-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v46-0003-Allow-logical-decoding-on-standby.patchDownload

From 4fa4e5ef5099e2771ca2e7e1813a1914ce751c74 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 31 Jan 2023 10:01:31 +0000
Subject: [PATCH v46 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 867675d5a1..1abe747cb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4465,6 +4465,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..6b66a971ba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..91acc0c155 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v46-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v46-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 5cfbe1368d1baf13c2884defeeadb61df215d979 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 31 Jan 2023 10:00:39 +0000
Subject: [PATCH v46 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d478724b9d..d64fb4cc84 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8891,6 +8891,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -9060,6 +9061,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9177,6 +9179,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..867675d5a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6447,6 +6447,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7086,6 +7088,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7251,7 +7254,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7966,6 +7970,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a272bd4a88 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..ec817381a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..0041896620 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..c8e11ab710 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10946,9 +10951,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..11ea206337 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v46-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v46-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From a94d82b17eb7a39ef45c6d617bfe9adf104bd60a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 31 Jan 2023 09:59:08 +0000
Subject: [PATCH v46 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  13 ++-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 203 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index d21a308d41..4462022904 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..aad14a401d 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e6024a980b..d478724b9d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6872,6 +6872,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8442,7 +8443,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8454,6 +8455,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..4e8a85fb5d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..8c58fdb8d1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..70a0c2418a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2eea866f06 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..7e9e47ce67 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool		isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..223db4b199 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +346,14 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +412,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..1e45d58845 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +237,15 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#227

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#226)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/31/23 12:50 PM, Drouvot, Bertrand wrote:

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:
Hi,
On 2023-01-05 16:15:39 -0500, Robert Haas wrote:

On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

0006:
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..98c96eb864 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
      RecoveryPauseState recoveryPauseState;
      ConditionVariable recoveryNotPausedCV;
+    /* Replay state (see getReplayedCV() for more explanation) */
+    ConditionVariable replayedCV;
+
      slock_t        info_lck;        /* locks shared variables shown above */
  } XLogRecoveryCtlData;
getReplayedCV() doesn't seem to fit into any of the naming scheems in use for
xlogrecovery.h.
Changed to check_for_replay() in V46 attached.
-         * Sleep until something happens or we time out.  Also wait for the
-         * socket becoming writable, if there's still pending output.
+         * When not in recovery, sleep until something happens or we time out.
+         * Also wait for the socket becoming writable, if there's still pending output.
Hm. Is there a problem with not handling the becoming-writable case in the
in-recovery case?
Yes, when not in recovery we'd wait for the timeout to occur in ConditionVariableTimedSleep()
(as the CV is broadcasted only in ApplyWalRecord()).
+        else
+        /*
+         * We are in the logical decoding on standby case.
+         * We are waiting for the startup process to replay wal record(s) using
+         * a timeout in case we are requested to stop.
+         */
+        {
I don't think pgindent will like that formatting....
Oops, fixed.
+            ConditionVariablePrepareToSleep(replayedCV);
+            ConditionVariableTimedSleep(replayedCV, 1000,
+                                        WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+        }
I think this is racy, see ConditionVariablePrepareToSleep()'s comment:

* Caution: "before entering the loop" means you *must* test the exit
* condition between calling ConditionVariablePrepareToSleep and calling
* ConditionVariableSleep. If that is inconvenient, omit calling
* ConditionVariablePrepareToSleep.

Basically, the ConditionVariablePrepareToSleep() should be before the loop
body.
I missed it, thanks! Moved it before the loop body.

I don't think the fixed timeout here makes sense. For one, we need to wake up
based on WalSndComputeSleeptime(), otherwise we're ignoring wal_sender_timeout
(which can be quite small).

Good point. Making use of WalSndComputeSleeptime() instead in V46.

It's also just way too frequent - we're trying to
avoid constantly waking up unnecessarily.

Perhaps we could deal with the pq_is_send_pending() issue by having a version
of ConditionVariableTimedSleep() that accepts a WaitEventSet?

What issue do you see?
The one that I see with V46 (keeping the in/not recovery branches) is that one may need to wait
for wal_sender_timeout to see changes that occurred right after the promotion.

Regards,

Attaching a tiny rebase (V47) due to f9bc34fcb6.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v47-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v47-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From c3e19fef57e6073d944699e3b82637ab01965ac1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 6 Feb 2023 10:56:48 +0000
Subject: [PATCH v47 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..2e8bee033f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v47-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v47-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From b21d1206ce1f4cab1892471d2a30e8a379e71ea1 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 6 Feb 2023 10:55:58 +0000
Subject: [PATCH v47 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  39 +
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 665 ++++++++++++++++++
 3 files changed, 705 insertions(+)
   4.9% src/test/perl/PostgreSQL/Test/
  94.8% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..fd81ddcf39 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,45 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	# Note: Write a C helper function to call LogStandbySnapshot() instead
+	# of asking for a checkpoint.
+	$primary->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 209118a639..eca90c5c8c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -39,6 +39,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..690170daaa
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,665 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 62;
+
+my ($stdin, $stdout, $stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	$node_standby->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node_standby->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my $wait = shift;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', \$stdout, '2>', \$stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node_standby->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active(1);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active(1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active(0);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active(0);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots();
+
+$handle = make_slot_active(1);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots();
+$handle = make_slot_active(1);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
-- 
2.34.1

v47-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v47-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From dec74ac245e44c6aefee3971143505745baa785a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 6 Feb 2023 10:54:49 +0000
Subject: [PATCH v47 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2a5352f879..30dddda5f8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4923,3 +4932,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..3fc7b42d15 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v47-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v47-0003-Allow-logical-decoding-on-standby.patchDownload

From d3672551efa8389428a0f79399bf1c50c91a5b60 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 6 Feb 2023 10:53:57 +0000
Subject: [PATCH v47 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 54d344a59c..5864c5e304 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4464,6 +4464,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..6b66a971ba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..91acc0c155 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v47-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v47-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From c77a6a31a6cfc0ba6757376d1953ec913652a4a9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 6 Feb 2023 10:52:54 +0000
Subject: [PATCH v47 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b5acbb5d39..3f303a27b8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8777,6 +8777,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8946,6 +8947,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9063,6 +9065,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..54d344a59c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6444,6 +6444,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6804,7 +6805,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7083,6 +7085,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7248,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7961,6 +7965,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a272bd4a88 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..ec817381a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5d439f2710..b2a75b6d72 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..c8e11ab710 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10946,9 +10951,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..11ea206337 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v47-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v47-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 85dd2b15aa87463495135581639771b7a2b28989 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Mon, 6 Feb 2023 10:43:55 +0000
Subject: [PATCH v47 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  13 ++-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 203 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 7a6d93bb87..1f044840d4 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede1..aad14a401d 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3bb1d5cff6..b5acbb5d39 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6758,6 +6758,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8328,7 +8329,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8340,6 +8341,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..4e8a85fb5d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..8c58fdb8d1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8003583c0a..70a0c2418a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c86e690980..321fc0d31b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2eea866f06 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..7e9e47ce67 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool		isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..223db4b199 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +346,14 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +412,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..1e45d58845 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +237,15 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#228

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#202)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/11/23 4:23 PM, Drouvot, Bertrand wrote:

Hi,

On 1/11/23 8:32 AM, Bharath Rupireddy wrote:

On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Please find attached, V37 taking care of:

Thanks. I started to digest the design specified in the commit message
and these patches.

Thanks for looking at it!

Here are some quick comments:

1. Does logical decoding on standby work without any issues if the
standby is set for cascading replication?

Without "any issues" is hard to guarantee ;-) But according to my tests:

Primary -> Standby1 with or without logical replication slot -> Standby2 with or without logical replication slot

works as expected (and also with cascading promotion).
We can add some TAP tests in 0004 though.

Cascading standby tests have been added in V48 attached.

It does test that:

- a sql logical decoding session on the cascading standby get expected output before/after the standby promotion
- a pg_recvlogical logical decoding session on the cascading standby (started before the standby promotion) get expected output before/after the standby promotion

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v48-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v48-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From d01ad6fbc8a23c3ddcc10fc3e11feb65e0fcdc5a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:48:52 +0000
Subject: [PATCH v48 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..2e8bee033f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,27 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v48-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v48-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 305eb583edf89057b6cd39539e1920da10a0269e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v48 5/6] New TAP test for logical decoding on standby.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  39 +
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 3 files changed, 750 insertions(+)
   4.4% src/test/perl/PostgreSQL/Test/
  95.3% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..fd81ddcf39 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,45 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	# Note: Write a C helper function to call LogStandbySnapshot() instead
+	# of asking for a checkpoint.
+	$primary->safe_psql('postgres', 'CHECKPOINT');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 209118a639..eca90c5c8c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -39,6 +39,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..cf1277bd1b
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_catchup($node_cascading_standby, 'replay', $node_primary->lsn('flush'));
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+$node_standby->wait_for_catchup($node_cascading_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_catchup($node_cascading_standby, 'replay', $node_standby->lsn('flush'));
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v48-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v48-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From fa17810dee089ccfb8c058e25b7804a47f01f67f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v48 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..3fc7b42d15 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v48-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v48-0003-Allow-logical-decoding-on-standby.patchDownload

From c4e2303b2b56c9328a6296a33911005537cfcd77 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v48 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 54d344a59c..5864c5e304 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4464,6 +4464,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..6b66a971ba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..91acc0c155 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v48-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v48-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 8dfd8911256b1841190205f6fff2052de7f1fe3b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v48 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 04e9bc5eb2..6524784583 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8686,6 +8686,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8855,6 +8856,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -8972,6 +8974,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..54d344a59c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6444,6 +6444,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6804,7 +6805,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7083,6 +7085,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7248,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7961,6 +7965,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a272bd4a88 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..ec817381a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5d439f2710..b2a75b6d72 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..c8e11ab710 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10946,9 +10951,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..11ea206337 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v48-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v48-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 2c753f732ef92d6d1780cd4778dba8d09e16a5b0 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:55:19 +0000
Subject: [PATCH v48 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  13 ++-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 203 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 7a6d93bb87..1f044840d4 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff..a607464b97 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7eb79cee58..04e9bc5eb2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6667,6 +6667,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8237,7 +8238,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8249,6 +8250,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..4e8a85fb5d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..8c58fdb8d1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..05abf36032 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 65750958bb..0178186d38 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2eea866f06 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..7e9e47ce67 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool		isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..223db4b199 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +346,14 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +412,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..1e45d58845 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +237,15 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#229

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#214)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:

Hi,

On 1/19/23 3:46 AM, Andres Freund wrote:

Hi,

On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:

On 1/6/23 4:40 AM, Andres Freund wrote:

Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.

Not sure I got it. What the C helper would be supposed to do?

Call LogStandbySnapshot().

Got it, I like the idea, will do.

0005 in V49 attached is introducing a new pg_log_standby_snapshot() function
and the TAP test is making use of it.

Documentation about this new function is also added in the "Snapshot Synchronization Functions"
section. I'm not sure that's the best place for it but did not find a better place yet.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v49-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v49-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 78a09b5097d44f457f5d19a18226ac32f03f5797 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v49 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v49-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v49-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 85494784ca3e37cedcdc714c1f216b229a9e112f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v49 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.0% src/backend/
   3.9% src/test/perl/PostgreSQL/Test/
  89.9% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e09e289a43..59334dd422 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26534,6 +26534,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c8e11ab710..48d7be075b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6384,6 +6384,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..247265d328 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 209118a639..eca90c5c8c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -39,6 +39,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..cf1277bd1b
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_catchup($node_cascading_standby, 'replay', $node_primary->lsn('flush'));
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
+$node_standby->wait_for_catchup($node_cascading_standby, 'replay', $node_primary->lsn('flush'));
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_catchup($node_cascading_standby, 'replay', $node_standby->lsn('flush'));
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v49-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v49-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From f86b422471169795995fc9dba694eef719ede673 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v49 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1e91cbc564..3fc7b42d15 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v49-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v49-0003-Allow-logical-decoding-on-standby.patchDownload

From b4f03a2fde8bc091427999ea0ea63335d905243f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v49 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 54d344a59c..5864c5e304 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4464,6 +4464,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..6b66a971ba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..91acc0c155 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -331,6 +330,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8885cdeebc..1e91cbc564 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v49-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v49-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From b1b8b19ff0aefb60227ee2a38ff407c317ed7a3f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v49 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..e25f71a776 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4365,6 +4365,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 04e9bc5eb2..6524784583 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8686,6 +8686,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8855,6 +8856,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -8972,6 +8974,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..54d344a59c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6444,6 +6444,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6804,7 +6805,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7083,6 +7085,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7248,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7961,6 +7965,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a272bd4a88 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..8885cdeebc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..ec817381a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5d439f2710..b2a75b6d72 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..afd62d3cc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..c8e11ab710 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10946,9 +10951,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..872eb35757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..11ea206337 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v49-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v49-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 05d208a94131552851917e021062af1334cf15e4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:55:19 +0000
Subject: [PATCH v49 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  13 ++-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 203 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 7a6d93bb87..1f044840d4 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff..a607464b97 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7eb79cee58..04e9bc5eb2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6667,6 +6667,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8237,7 +8238,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8249,6 +8250,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..4e8a85fb5d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..8c58fdb8d1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..05abf36032 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 65750958bb..0178186d38 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2eea866f06 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..7e9e47ce67 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool		isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..223db4b199 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +346,14 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +412,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..1e45d58845 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +237,15 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038d..0cfe02aa4a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#230

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#229)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 2/7/23 4:29 PM, Drouvot, Bertrand wrote:

Hi,

On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:

Hi,

On 1/19/23 3:46 AM, Andres Freund wrote:

Hi,

On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:

On 1/6/23 4:40 AM, Andres Freund wrote:

Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.

Not sure I got it. What the C helper would be supposed to do?

Call LogStandbySnapshot().

Got it, I like the idea, will do.

0005 in V49 attached is introducing a new pg_log_standby_snapshot() function
and the TAP test is making use of it.

Documentation about this new function is also added in the "Snapshot Synchronization Functions"
section. I'm not sure that's the best place for it but did not find a better place yet.

Attaching V50, tiny update in the TAP test (aka 0005) to make use of the wait_for_replay_catchup()
wrapper just added in a1acdacada.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v50-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v50-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 81d85eb3b4ae84cf7516410359b79a65042ae0a9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v50 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v50-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v50-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 0fbca11968af00ee099d62c993f46104574e2db9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v50 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/034_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e09e289a43..59334dd422 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26534,6 +26534,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ca88f48079..d1a9082625 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6393,6 +6393,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 3e2a27fb71..da58257f4f 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3060,6 +3060,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 209118a639..eca90c5c8c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -39,6 +39,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_standby_logical_decoding.pl b/src/test/recovery/t/034_standby_logical_decoding.pl
new file mode 100644
index 0000000000..8c45180c35
--- /dev/null
+++ b/src/test/recovery/t/034_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v50-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v50-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 56520fa522b9c470a341091fee5db7ec34899ed8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v50 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3042e5bd64..5034194e1b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..38c747b786 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -463,6 +463,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..04a37feee4 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -130,6 +130,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v50-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v50-0003-Allow-logical-decoding-on-standby.patchDownload

From 1a1cfab5e2e350b37b34ddde52522869721044db Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v50 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 54d344a59c..5864c5e304 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4464,6 +4464,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a53e23c679..6b66a971ba 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..743d12ba14 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +341,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c2523c5caf..3042e5bd64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v50-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v50-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 0d07e9cbd16b2ddcaf18fd654f5ea5ea4fd7da93 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v50 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dca50707ad..d5ec656c10 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4658,6 +4658,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 04e9bc5eb2..6524784583 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8686,6 +8686,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8855,6 +8856,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -8972,6 +8974,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..54d344a59c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6444,6 +6444,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6804,7 +6805,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7083,6 +7085,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7248,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7961,6 +7965,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..20c70be5a2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..c2523c5caf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..ec817381a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -475,6 +476,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -508,6 +513,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -526,7 +532,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1487,6 +1495,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5d439f2710..b2a75b6d72 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2481,6 +2481,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3050,6 +3053,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9d707c3521..048af5bf40 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66b73c3900..ca88f48079 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10955,9 +10960,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index db9675884f..c1095e374c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -341,6 +341,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 174b725fff..56e48b50f0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v50-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v50-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 66a5bb3b9bf95f250eceaed7f98a30794ea4fb5d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:55:19 +0000
Subject: [PATCH v50 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  13 ++-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 203 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 7a6d93bb87..1f044840d4 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff..a607464b97 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7eb79cee58..04e9bc5eb2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6667,6 +6667,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8237,7 +8238,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8249,6 +8250,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032..4e8a85fb5d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67b7b1710c..8c58fdb8d1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..05abf36032 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 41b16cb89b..48d1d6b506 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3352,6 +3352,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 65750958bb..0178186d38 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2eea866f06 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..7e9e47ce67 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	int			ntuples;
+	bool		isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..223db4b199 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +346,14 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +412,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..1e45d58845 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +237,15 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 67f994cb3e..52845497cc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#231

Ashutosh Sharma

ashu.coek88@gmail.com

almost 3 years ago

In reply to: Andres Freund (#208)

Re: Minimal logical decoding on standbys

On Thu, Jan 12, 2023 at 11:46 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-12 20:08:55 +0530, Ashutosh Sharma wrote:

I previously participated in the discussion on "Synchronizing the
logical replication slots from Primary to Standby" and one of the
purposes of that project was to synchronize logical slots from primary
to standby so that if failover occurs, it will not affect the logical
subscribers of the old primary much. Can someone help me understand
how we are going to solve this problem with this patch? Are we going
to encourage users to do LR from standby instead of primary to get rid
of such problems during failover?

It only provides a building block towards that. The "Synchronizing the logical
replication slots from Primary to Standby" project IMO needs all of the
infrastructure in this patch. With the patch, a logical rep solution can
e.g. maintain one slot on the primary and one on the standby, and occasionally
forward the slot on the standby to the position of the slot on the primary. In
case of a failover it can just start consuming changes from the former
standby, all the necessary changes are guaranteed to be present.

Also, one small observation:

I just played around with the latest (v38) patch a bit and found that
when a new logical subscriber of standby is created, it actually
creates two logical replication slots for it on the standby server.
May I know the reason for creating an extra replication slot other
than the one created by create subscription command? See below:

That's unrelated to this patch. There's no changes to the "higher level"
logical replication code dealing with pubs and subs, it's all on the "logical
decoding" level.

I think this because logical rep wants to be able to concurrently perform
ongoing replication, and synchronize tables added to the replication set. The
pg_16399_sync_16392_7187728548042694423 slot should vanish after the initial
synchronization.

Thanks Andres. I have one more query (both for you and Bertrand). I
don't know if this has already been answered somewhere in this mail
thread, if yes, please let me know the mail that answers this query.

Will there be a problem if we mandate the use of physical replication
slots and hot_standby_feedback to support minimum LD on standby. I
know people can do a physical replication setup without a replication
slot or even with hot_standby_feedback turned off, but are we going to
have any issue if we ask them to use a physical replication slot and
turn on hot_standby_feedback for LD on standby. This will reduce the
code changes required to do conflict handling for logical slots on
standby which is being done by v50-0001 and v50-0002* patches
currently.

IMHO even in normal scenarios i.e. when we are not doing LD on
standby, we should mandate the use of a physical replication slot.

--
With Regards,
Ashutosh Sharma.

#232

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Ashutosh Sharma (#231)

Re: Minimal logical decoding on standbys

Hi,

On 2/15/23 1:32 PM, Ashutosh Sharma wrote:

Thanks Andres. I have one more query (both for you and Bertrand). I
don't know if this has already been answered somewhere in this mail
thread, if yes, please let me know the mail that answers this query.

Will there be a problem if we mandate the use of physical replication
slots and hot_standby_feedback to support minimum LD on standby.

I don't think we have to make it mandatory. There is use cases
where it's not needed and mentioned by Andres up-thread [1]/messages/by-id/20211028210755.afmwcvpo6ajwdx6n@alap3.anarazel.de (see the comment
"The patch deals with this...")

I know people can do a physical replication setup without a replication
slot or even with hot_standby_feedback turned off, but are we going to
have any issue if we ask them to use a physical replication slot and
turn on hot_standby_feedback for LD on standby. This will reduce the
code changes required to do conflict handling for logical slots on
standby which is being done by v50-0001 and v50-0002* patches
currently.

But on the other hand we'd need to ensure that the primary physical slot
and HSF set to on on the standby are always in place. That would probably lead
to extra code too.

I'm -1 on that but +1 on the fact that it should be documented (as done in 0006).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: /messages/by-id/20211028210755.afmwcvpo6ajwdx6n@alap3.anarazel.de

#233

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Ashutosh Sharma (#231)

Re: Minimal logical decoding on standbys

Hi,

On 2023-02-15 18:02:11 +0530, Ashutosh Sharma wrote:

Thanks Andres. I have one more query (both for you and Bertrand). I
don't know if this has already been answered somewhere in this mail
thread, if yes, please let me know the mail that answers this query.

Will there be a problem if we mandate the use of physical replication
slots and hot_standby_feedback to support minimum LD on standby. I
know people can do a physical replication setup without a replication
slot or even with hot_standby_feedback turned off, but are we going to
have any issue if we ask them to use a physical replication slot and
turn on hot_standby_feedback for LD on standby. This will reduce the
code changes required to do conflict handling for logical slots on
standby which is being done by v50-0001 and v50-0002* patches
currently.

I don't think it would. E.g. while restoring from archives we can't rely on
knowing that the slot still exists on the primary.

We can't just do corrupt things, including potentially crashing, when the
configuration is wrong. We can't ensure that the configuration is accurate all
the time. So we need to detect this case. Hence needing to detect conflicts.

IMHO even in normal scenarios i.e. when we are not doing LD on
standby, we should mandate the use of a physical replication slot.

I don't think that's going to fly. There plenty scenarios where you e.g. don't
want to use a slot, e.g. when you want to limit space use on the primary.

Greetings,

Andres Freund

#234

Ashutosh Sharma

ashu.coek88@gmail.com

almost 3 years ago

In reply to: Andres Freund (#233)

Re: Minimal logical decoding on standbys

On Wed, Feb 15, 2023 at 11:48 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-02-15 18:02:11 +0530, Ashutosh Sharma wrote:

Thanks Andres. I have one more query (both for you and Bertrand). I
don't know if this has already been answered somewhere in this mail
thread, if yes, please let me know the mail that answers this query.

Will there be a problem if we mandate the use of physical replication
slots and hot_standby_feedback to support minimum LD on standby. I
know people can do a physical replication setup without a replication
slot or even with hot_standby_feedback turned off, but are we going to
have any issue if we ask them to use a physical replication slot and
turn on hot_standby_feedback for LD on standby. This will reduce the
code changes required to do conflict handling for logical slots on
standby which is being done by v50-0001 and v50-0002* patches
currently.

I don't think it would. E.g. while restoring from archives we can't rely on
knowing that the slot still exists on the primary.

We can't just do corrupt things, including potentially crashing, when the
configuration is wrong. We can't ensure that the configuration is accurate all
the time. So we need to detect this case. Hence needing to detect conflicts.

OK. Got it, thanks.

IMHO even in normal scenarios i.e. when we are not doing LD on
standby, we should mandate the use of a physical replication slot.

I don't think that's going to fly. There plenty scenarios where you e.g. don't
want to use a slot, e.g. when you want to limit space use on the primary.

I think this can be controlled via max_slot_wal_keep_size GUC, if I
understand it correctly.

--
With Regards,
Ashutosh Sharma.

#235

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#230)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 2/13/23 4:27 PM, Drouvot, Bertrand wrote:

Hi,

On 2/7/23 4:29 PM, Drouvot, Bertrand wrote:

Hi,

On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:

Hi,

On 1/19/23 3:46 AM, Andres Freund wrote:

Hi,

On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:

On 1/6/23 4:40 AM, Andres Freund wrote:

Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.

Not sure I got it. What the C helper would be supposed to do?

Call LogStandbySnapshot().

Got it, I like the idea, will do.

0005 in V49 attached is introducing a new pg_log_standby_snapshot() function
and the TAP test is making use of it.

Documentation about this new function is also added in the "Snapshot Synchronization Functions"
section. I'm not sure that's the best place for it but did not find a better place yet.

Attaching V50, tiny update in the TAP test (aka 0005) to make use of the wait_for_replay_catchup()
wrapper just added in a1acdacada.

Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and 8a8661828a (for 0005).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v51-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v51-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From cb1d824602f67526029eaecce0a5e962f0188ce6 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v51 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v51-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v51-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 330fd5faef129399dfc5f6a2cbac2ef9a6799baa Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v51 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0cbdf63632..4006100f42 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26548,6 +26548,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8329d05d68..c49162f662 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6393,6 +6393,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 3e2a27fb71..da58257f4f 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3060,6 +3060,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..8c45180c35
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v51-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v51-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 4b06d11910884fbbd26a53feb5f727e4c5fc167d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v51 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3042e5bd64..5034194e1b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index cb99cc6339..e1b80e6202 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 9ab23e1c4a..548ef41dca 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v51-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v51-0003-Allow-logical-decoding-on-standby.patchDownload

From 9981fbb75d8a90a00af843c56a7ba07c99e7e573 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v51 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 54d344a59c..5864c5e304 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4464,6 +4464,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8457eec4c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..743d12ba14 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +341,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c2523c5caf..3042e5bd64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v51-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v51-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From e1c324b86ada1ce62b17d9b6e95c76e0e7ad0945 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v51 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b0b997f092..33f525a689 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4663,6 +4663,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 04e9bc5eb2..6524784583 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8686,6 +8686,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8855,6 +8856,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -8972,6 +8974,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..54d344a59c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6444,6 +6444,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6804,7 +6805,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7083,6 +7085,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7248,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7961,6 +7965,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..20c70be5a2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..c2523c5caf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9a73ae67d0..db5c3333cc 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..b5f9aa285c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2488,6 +2488,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3057,6 +3060,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9d707c3521..048af5bf40 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e2a7642a2b..8329d05d68 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10955,9 +10960,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index db9675884f..c1095e374c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -341,6 +341,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..1b6600884e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v51-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v51-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 9c22d89c0ec2033b35465f0e9586af89c9ca5969 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:55:19 +0000
Subject: [PATCH v51 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  13 ++-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 203 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.1% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 7a6d93bb87..1f044840d4 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff..a607464b97 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7eb79cee58..04e9bc5eb2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6667,6 +6667,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8237,7 +8238,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8249,6 +8250,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3f7b541e9d..a213407fee 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 02b9601bec..1207a49689 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..05abf36032 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7777e7ec77..98a712f4ec 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3365,6 +3365,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 65750958bb..0178186d38 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 09f9b0f8c6..2eea866f06 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,13 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/*
-	 * In payload of blk 0 : todelete OffsetNumbers
-	 */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -100,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 9894ab9afe..6c5535fe73 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16			ntuples;
+	bool        isCatalogRel;   /* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8cb0d8da19..223db4b199 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,12 +346,14 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* FREEZE PLANS FOLLOW */
 	/* OFFSET NUMBER ARRAY FOLLOWS */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -408,7 +412,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index edd1333d9b..1e45d58845 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,13 +237,15 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
 	/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 67f994cb3e..52845497cc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#236

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#235)

Re: Minimal logical decoding on standbys

On Mon, 2023-02-27 at 09:40 +0100, Drouvot, Bertrand wrote:

Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and
8a8661828a (for 0005).

[ Jumping into this thread late, so I apologize if these comments have
already been covered. ]

Regarding v51-0004:

* Why is the CV sleep not being canceled?
* Comments on WalSndWaitForWal need to be updated to explain the
difference between the flush (primary) and the replay (standby) cases.

Overall, it seems like what you really want for the sleep/wakeup logic
in WalSndWaitForLSN is something like this:

condVar = RecoveryInProgress() ? replayCV : flushCV;
waitEvent = RecoveryInProgress() ?
WAIT_EVENT_WAL_SENDER_WAIT_REPLAY :
WAIT_EVENT_WAL_SENDER_WAIT_FLUSH;

ConditionVariablePrepareToSleep(condVar);
for(;;)
{
...
sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp());
socketEvents = WL_SOCKET_READABLE;
if (pq_is_send_pending())
socketEvents = WL_SOCKET_WRITABLE;
ConditionVariableTimedSleepOrEvents(
condVar, sleeptime, socketEvents, waitEvent);
}
ConditionVariableCancelSleep();

But the problem is that ConditionVariableTimedSleepOrEvents() doesn't
exist, and I think that's what Andres was suggesting here[1]/messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de.
WalSndWait() only waits for a timeout or a socket event, but not a CV;
ConditionVariableTimedSleep() only waits for a timeout or a CV, but not
a socket event.

I'm also missing how WalSndWait() works currently. It calls
ModifyWaitEvent() with NULL for the latch, so how does WalSndWakeup()
wake it up?

Assuming I'm wrong, and WalSndWait() does use the latch, then I guess
it could be extended by having two different latches in the WalSnd
structure, and waking them up separately and waiting on the right one.
Not sure if that's a good idea though.

[1]: /messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de
/messages/by-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#237

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#236)

Re: Minimal logical decoding on standbys

Hi,

On 3/1/23 1:48 AM, Jeff Davis wrote:

On Mon, 2023-02-27 at 09:40 +0100, Drouvot, Bertrand wrote:

Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and
8a8661828a (for 0005).

[ Jumping into this thread late, so I apologize if these comments have
already been covered. ]

Thanks for looking at it!

Regarding v51-0004:

* Why is the CV sleep not being canceled?

I think that's an oversight, I'll look at it.

* Comments on WalSndWaitForWal need to be updated to explain the
difference between the flush (primary) and the replay (standby) cases.

Yeah, will do.

Overall, it seems like what you really want for the sleep/wakeup logic
in WalSndWaitForLSN

Typo for WalSndWaitForWal()?

is something like this:

condVar = RecoveryInProgress() ? replayCV : flushCV;
waitEvent = RecoveryInProgress() ?
WAIT_EVENT_WAL_SENDER_WAIT_REPLAY :
WAIT_EVENT_WAL_SENDER_WAIT_FLUSH;

ConditionVariablePrepareToSleep(condVar);
for(;;)
{
...
sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp());
socketEvents = WL_SOCKET_READABLE;
if (pq_is_send_pending())
socketEvents = WL_SOCKET_WRITABLE;
ConditionVariableTimedSleepOrEvents(
condVar, sleeptime, socketEvents, waitEvent);
}
ConditionVariableCancelSleep();

But the problem is that ConditionVariableTimedSleepOrEvents() doesn't
exist, and I think that's what Andres was suggesting here[1].
WalSndWait() only waits for a timeout or a socket event, but not a CV;
ConditionVariableTimedSleep() only waits for a timeout or a CV, but not
a socket event.

I'm also missing how WalSndWait() works currently. It calls
ModifyWaitEvent() with NULL for the latch, so how does WalSndWakeup()
wake it up?

I think it works because the latch is already assigned to the FeBeWaitSet
in pq_init()->AddWaitEventToSet() (for latch_pos).

Assuming I'm wrong, and WalSndWait() does use the latch, then I guess
it could be extended by having two different latches in the WalSnd
structure, and waking them up separately and waiting on the right one.

I'm not sure this is needed in this particular case, because:

Why not "simply" call ConditionVariablePrepareToSleep() without any call to ConditionVariableTimedSleep() later?

In that case the walsender will be put in the wait queue (thanks to ConditionVariablePrepareToSleep())
and will be waked up by the event on the socket, the timeout or the CV broadcast (since IIUC they all rely on the same latch).

So, something like:

condVar = RecoveryInProgress() ? replayCV : flushCV;
ConditionVariablePrepareToSleep(condVar);
for(;;)
{
...
sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp());
socketEvents = WL_SOCKET_READABLE;
if (pq_is_send_pending())
socketEvents = WL_SOCKET_WRITABLE;
WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL); <-- Note: the code within the loop does not change at all

}
ConditionVariableCancelSleep();

If the walsender is waked up by the CV broadcast, then it means the flush/replay occurred and then we should exit the loop
right after due to:

"
/* check whether we're done */
if (loc <= RecentFlushPtr)
break;

meaning that in this particular case there is only one wake up due to the CV broadcast before exiting the loop.

That looks weird to use ConditionVariablePrepareToSleep() without actually using ConditionVariableTimedSleep()
but it looks to me that it would achieve the same goal: having the walsender being waked up
by the event on the socket, the timeout or the CV broadcast.

In that case we would be missing the WAIT_EVENT_WAL_SENDER_WAIT_REPLAY and/or the WAIT_EVENT_WAL_SENDER_WAIT_FLUSH
wait events thought (and we'd just provide the WAIT_EVENT_WAL_SENDER_WAIT_WAL one) but I'm not sure that's a big issue.

What do you think?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#238

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#237)

Re: Minimal logical decoding on standbys

On Wed, 2023-03-01 at 11:51 +0100, Drouvot, Bertrand wrote:

Why not "simply" call ConditionVariablePrepareToSleep() without any
call to ConditionVariableTimedSleep() later?

ConditionVariableSleep() re-inserts itself into the queue if it was
previously removed. Without that, a single wakeup could remove it from
the wait queue, and the effects of ConditionVariablePrepareToSleep()
would be lost.

In that case the walsender will be put in the wait queue (thanks to
ConditionVariablePrepareToSleep())
and will be waked up by the event on the socket, the timeout or the
CV broadcast

I believe it will only be awakened once, and if it enters WalSndWait()
again, future ConditionVariableBroadcast/Signal() calls won't wake it
up any more.

(since IIUC they all rely on the same latch).

Relying on that fact seems like too much action-at-a-distance to me. If
we change the implementation of condition variables, then it would stop
working.

Also, since they are using the same latch, that means we are still
waking up too frequently, right? We haven't really solved the problem.

That looks weird to use ConditionVariablePrepareToSleep() without
actually using ConditionVariableTimedSleep()
but it looks to me that it would achieve the same goal: having the
walsender being waked up
by the event on the socket, the timeout or the CV broadcast.

I don't think it actually works, because something needs to keep re-
inserting it into the queue after it gets removed. You could maybe hack
it to put ConditionVariablePrepareToSleep() *in* the loop, and never
sleep. But that just seems like too much of a hack, and I didn't really
look at the details to see if that would actually work.

To use condition variables properly, I think we'd need an API like
ConditionVariableEventsSleep(), which takes a WaitEventSet and a
timeout. I think this is what Andres was suggesting and seems like a
good idea. I looked into it and I don't think it's too hard to
implement -- we just need to WaitEventSetWait instead of WaitLatch.
There are a few details to sort out, like how to enable callers to
easily create the right WaitEventSet (it obviously needs to include
MyLatch, for instance) and update it with the right socket events.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#239

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#238)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/2/23 1:40 AM, Jeff Davis wrote:

On Wed, 2023-03-01 at 11:51 +0100, Drouvot, Bertrand wrote:

Why not "simply" call ConditionVariablePrepareToSleep() without any
call to ConditionVariableTimedSleep() later?

ConditionVariableSleep() re-inserts itself into the queue if it was
previously removed. Without that, a single wakeup could remove it from
the wait queue, and the effects of ConditionVariablePrepareToSleep()
would be lost.

Right, but in our case, right after the wakeup (the one due to the CV broadcast,
aka the one that will remove it from the wait queue) we'll exit the loop due to:

"
/* check whether we're done */
if (loc <= RecentFlushPtr)
break;
"

as the CV broadcast means that a flush/replay occurred.

So I don't see any issue in this particular case (as we are removed from the queue
but we'll not have to wait anymore).

In that case the walsender will be put in the wait queue (thanks to
ConditionVariablePrepareToSleep())
and will be waked up by the event on the socket, the timeout or the
CV broadcast

I believe it will only be awakened once, and if it enters WalSndWait()
again, future ConditionVariableBroadcast/Signal() calls won't wake it
up any more.

I don't think that's right and that's not what my testing shows (please find attached 0004-CV-POC.txt,
a .txt file to not break the CF bot), as:

- If it is awakened due to the CV broadcast, then we'll right after exit the loop (see above)
- If it is awakened due to the timeout or the socket event then we're still in the CV wait queue
(as nothing removed it from the CV wait queue).

(since IIUC they all rely on the same latch).

Relying on that fact seems like too much action-at-a-distance to me
If
we change the implementation of condition variables, then it would stop
working.

I'm not sure about this one. I mean it would depend what the implementation changes are.
Also the related TAP test (0005) would probably fail or start taking a long time due to
the corner case we are trying to solve here coming back (like it was detected in [1]/messages/by-id/47606911-cf44-5a62-21d5-366d3bc6e445@enterprisedb.com)

Also, since they are using the same latch, that means we are still
waking up too frequently, right? We haven't really solved the problem.

I don't think so as the first CV broadcast will make us exit the loop.
So, ISTM that we'll wake up as we currently do, expect when there is a flush/replay
which is what we want, right?

That looks weird to use ConditionVariablePrepareToSleep() without
actually using ConditionVariableTimedSleep()
but it looks to me that it would achieve the same goal: having the
walsender being waked up
by the event on the socket, the timeout or the CV broadcast.

I don't think it actually works, because something needs to keep re-
inserting it into the queue after it gets removed.

I think that's not needed as we'd exit the loop right after we are awakened by a CV broadcast.

To use condition variables properly, I think we'd need an API like
ConditionVariableEventsSleep(), which takes a WaitEventSet and a
timeout. I think this is what Andres was suggesting and seems like a
good idea. I looked into it and I don't think it's too hard to
implement -- we just need to WaitEventSetWait instead of WaitLatch.
There are a few details to sort out, like how to enable callers to
easily create the right WaitEventSet (it obviously needs to include
MyLatch, for instance) and update it with the right socket events.

I agree that's a good idea and that it should/would work too. I just wanted to highlight that in this particular
case that might not be necessary to build this new API.

[1]: /messages/by-id/47606911-cf44-5a62-21d5-366d3bc6e445@enterprisedb.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

0004-CV-POC.txttext/plain; charset=UTF-8; name=0004-CV-POC.txtDownload

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3042e5bd64..05350bb535 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,17 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
+
+	/*
+	 * Prepare the replayedCV to sleep. Note that this is enough to be added
+	 * in the wait queue and then waked up (while in WalSndWait() below)
+	 * by ConditionVariableBroadcast() during the WAL replay. Also Note that
+	 * if awakaned by the CV broadcast we'll exit the loop right after due to
+	 * the loc <= RecentFlushPtr test done in the loop. Indeed, CV brodcast
+	 * would mean that a replay occured.
+	 */
+	ConditionVariablePrepareToSleep(replayedCV);
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1670,6 +1681,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
 	}
 
+	ConditionVariableCancelSleep();
 	/* reactivate latch so WalSndLoop knows to continue */
 	SetLatch(MyLatch);
 	return RecentFlushPtr;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.

#240

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#239)

Re: Minimal logical decoding on standbys

On Thu, 2023-03-02 at 10:20 +0100, Drouvot, Bertrand wrote:

Right, but in our case, right after the wakeup (the one due to the CV
broadcast,
aka the one that will remove it from the wait queue) we'll exit the
loop due to:

"
         /* check whether we're done */
         if (loc <= RecentFlushPtr)
             break;
"

as the CV broadcast means that a flush/replay occurred.

But does it mean that the flush/replay advanced *enough* to be greater
than or equal to loc?

- If it is awakened due to the CV broadcast, then we'll right after
exit the loop (see above)

...

I think that's not needed as we'd exit the loop right after we are
awakened by a CV broadcast.

See the comment here:

* If this process has been taken out of the wait list, then we know
* that it has been signaled by ConditionVariableSignal (or
* ConditionVariableBroadcast), so we should return to the caller. But
* that doesn't guarantee that the exit condition is met, only that we
* ought to check it.

You seem to be arguing that in this case, it doesn't matter; that
walreceiver knows what walsender is waiting for, and will never wake it
up before it's ready. I don't think that's true, and even if it is, it
needs explanation.

I agree that's a good idea and that it should/would work too. I just
wanted to highlight that in this particular
case that might not be necessary to build this new API.

In this case it looks easier to add the right API than to be sure about
whether it's needed or not.

Regards,
Jeff Davis

#241

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#240)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.

I attached a sketch of one approach. I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v1-0001-Introduce-ConditionVariableEventSleep.patchtext/x-patch; charset=UTF-8; name=v1-0001-Introduce-ConditionVariableEventSleep.patchDownload

From ada1c8f373caa971dc0d8ef2144f1e01100d335c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 1 Mar 2023 20:02:42 -0800
Subject: [PATCH v1] Introduce ConditionVariableEventSleep().

The new API takes a WaitEventSet which can include socket events. The
WaitEventSet must have been created by
ConditionVariableWaitSetCreate(), another new function, so that it
includes the wait events necessary for a condition variable.
---
 src/backend/storage/lmgr/condition_variable.c | 102 ++++++++++++++++--
 src/backend/storage/lmgr/proc.c               |   6 ++
 src/backend/utils/init/miscinit.c             |   1 +
 src/include/storage/condition_variable.h      |  10 ++
 4 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 7e2bbf46d9..4dc6109786 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -27,9 +27,29 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#define ConditionVariableWaitSetLatchPos	0
+
 /* Initially, we are not prepared to sleep on any condition variable. */
 static ConditionVariable *cv_sleep_target = NULL;
 
+/* Used by ConditionVariableSleep() and ConditionVariableTimedSleep(). */
+static WaitEventSet *ConditionVariableWaitSet = NULL;
+
+/*
+ * Initialize the process-local condition variable WaitEventSet.
+ *
+ * This must be called once during startup of any process that can wait on
+ * condition variables, before it issues any ConditionVariableInit() calls.
+ */
+void
+InitializeConditionVariableWaitSet(void)
+{
+	Assert(ConditionVariableWaitSet == NULL);
+
+	ConditionVariableWaitSet = ConditionVariableWaitSetCreate(
+		TopMemoryContext, 0);
+}
+
 /*
  * Initialize a condition variable.
  */
@@ -40,6 +60,51 @@ ConditionVariableInit(ConditionVariable *cv)
 	proclist_init(&cv->wakeup);
 }
 
+/*
+ * Create a WaitEventSet for ConditionVariableEventSleep(). This should be
+ * used when the caller of ConditionVariableEventSleep() would like to wake up
+ * on either the condition variable signal or a socket event. For example:
+ *
+ *   ConditionVariableInit(&cv);
+ *   waitset = ConditionVariableWaitSetCreate(mcxt, 1);
+ *   event_pos = AddWaitEventToSet(waitset, 0, sock, NULL, NULL);
+ *   ...
+ *   ConditionVariablePrepareToSleep(&cv);
+ *   while (...condition not met...)
+ *   {
+ *       socket_wait_events = ...
+ *       ModifyWaitEvent(waitset, event_pos, socket_wait_events, NULL);
+ *       ConditionVariableEventSleep(&cv, waitset, ...);
+ *   }
+ *   ConditionVariableCancelSleep();
+ *
+ * The waitset is created with the standard events for a condition variable,
+ * and room for adding n_socket_events additional socket events. The
+ * initially-filled event positions should not be modified, but added socket
+ * events can be modified. The same waitset can be used for multiple condition
+ * variables as long as the callers of ConditionVariableEventSleep() are
+ * interested in the same sockets.
+ */
+WaitEventSet *
+ConditionVariableWaitSetCreate(MemoryContext mcxt, int n_socket_events)
+{
+	int				 latch_pos   PG_USED_FOR_ASSERTS_ONLY;
+	int				 n_cv_events = IsUnderPostmaster ? 2 : 1;
+	int				 nevents	 = n_cv_events + n_socket_events;
+	WaitEventSet    *waitset	 = CreateWaitEventSet(mcxt, nevents);
+
+	latch_pos = AddWaitEventToSet(waitset, WL_LATCH_SET, PGINVALID_SOCKET,
+								  MyLatch, NULL);
+
+	if (IsUnderPostmaster)
+		AddWaitEventToSet(waitset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+						  NULL, NULL);
+
+	Assert(latch_pos == ConditionVariableWaitSetLatchPos);
+
+	return waitset;
+}
+
 /*
  * Prepare to wait on a given condition variable.
  *
@@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
-	(void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+	(void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
+									   -1 /* no timeout */ ,
 									   wait_event_info);
 }
 
@@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 bool
 ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 							uint32 wait_event_info)
+{
+	return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
+									   wait_event_info);
+}
+
+/*
+ * Wait for a condition variable to be signaled, a timeout to be reached, or a
+ * socket event in the given waitset. The waitset must have been created by
+ * ConditionVariableWaitSetCreate().
+ *
+ * Returns true when timeout expires, otherwise returns false.
+ *
+ * See ConditionVariableSleep() for general usage.
+ */
+bool
+ConditionVariableEventSleep(ConditionVariable *cv, WaitEventSet *waitset,
+							long timeout, uint32 wait_event_info)
 {
 	long		cur_timeout = -1;
 	instr_time	start_time;
 	instr_time	cur_time;
-	int			wait_events;
 
 	/*
 	 * If the caller didn't prepare to sleep explicitly, then do so now and
@@ -147,24 +229,28 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		INSTR_TIME_SET_CURRENT(start_time);
 		Assert(timeout >= 0 && timeout <= INT_MAX);
 		cur_timeout = timeout;
-		wait_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
 	}
-	else
-		wait_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
 
 	while (true)
 	{
 		bool		done = false;
+		WaitEvent	cvEvent;
+		int			nevents;
 
 		/*
-		 * Wait for latch to be set.  (If we're awakened for some other
-		 * reason, the code below will cope anyway.)
+		 * Wait for latch to be set, or other events which will be handled
+		 * below.
 		 */
-		(void) WaitLatch(MyLatch, wait_events, cur_timeout, wait_event_info);
+		nevents = WaitEventSetWait(waitset, cur_timeout, &cvEvent,
+								   1, wait_event_info);
 
 		/* Reset latch before examining the state of the wait list. */
 		ResetLatch(MyLatch);
 
+		/* If a socket event occurred, no need to check wait list. */
+		if (nevents == 1 && (cvEvent.events & WL_SOCKET_MASK) != 0)
+			return true;
+
 		/*
 		 * If this process has been taken out of the wait list, then we know
 		 * that it has been signaled by ConditionVariableSignal (or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..ae4a7aecd4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -440,6 +440,9 @@ InitProcess(void)
 	OwnLatch(&MyProc->procLatch);
 	SwitchToSharedLatch();
 
+	/* Initialize process-local condition variable support */
+	InitializeConditionVariableWaitSet();
+
 	/* now that we have a proc, report wait events to shared memory */
 	pgstat_set_wait_event_storage(&MyProc->wait_event_info);
 
@@ -596,6 +599,9 @@ InitAuxiliaryProcess(void)
 	OwnLatch(&MyProc->procLatch);
 	SwitchToSharedLatch();
 
+	/* Initialize process-local condition variable support */
+	InitializeConditionVariableWaitSet();
+
 	/* now that we have a proc, report wait events to shared memory */
 	pgstat_set_wait_event_storage(&MyProc->wait_event_info);
 
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 59532bbd80..8731d076cc 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "postmaster/postmaster.h"
+#include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 589bdd323c..94adb54b91 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -22,6 +22,7 @@
 #ifndef CONDITION_VARIABLE_H
 #define CONDITION_VARIABLE_H
 
+#include "storage/latch.h"
 #include "storage/proclist_types.h"
 #include "storage/spin.h"
 
@@ -42,9 +43,14 @@ typedef union ConditionVariableMinimallyPadded
 	char		pad[CV_MINIMAL_SIZE];
 } ConditionVariableMinimallyPadded;
 
+extern void InitializeConditionVariableWaitSet(void);
+
 /* Initialize a condition variable. */
 extern void ConditionVariableInit(ConditionVariable *cv);
 
+extern WaitEventSet *ConditionVariableWaitSetCreate(MemoryContext mcxt,
+													int n_socket_events);
+
 /*
  * To sleep on a condition variable, a process should use a loop which first
  * checks the condition, exiting the loop if it is met, and then calls
@@ -56,6 +62,10 @@ extern void ConditionVariableInit(ConditionVariable *cv);
 extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern bool ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 										uint32 wait_event_info);
+extern bool ConditionVariableEventSleep(ConditionVariable *cv,
+										WaitEventSet *cvEventSet,
+										long timeout,
+										uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);
 
 /*
-- 
2.34.1

#242

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#240)

Re: Minimal logical decoding on standbys

Hi,

On 3/2/23 8:45 PM, Jeff Davis wrote:

On Thu, 2023-03-02 at 10:20 +0100, Drouvot, Bertrand wrote:

Right, but in our case, right after the wakeup (the one due to the CV
broadcast,
aka the one that will remove it from the wait queue) we'll exit the
loop due to:

"
         /* check whether we're done */
         if (loc <= RecentFlushPtr)
             break;
"

as the CV broadcast means that a flush/replay occurred.

But does it mean that the flush/replay advanced *enough* to be greater
than or equal to loc?

Yes I think so: loc is when we started waiting initially
and RecentFlushPtr is >= to when the broadcast has been sent.

- If it is awakened due to the CV broadcast, then we'll right after
exit the loop (see above)

...

I think that's not needed as we'd exit the loop right after we are
awakened by a CV broadcast.

See the comment here:
WalSndWaitForWal
* If this process has been taken out of the wait list, then we know
* that it has been signaled by ConditionVariableSignal (or
* ConditionVariableBroadcast), so we should return to the caller. But
* that doesn't guarantee that the exit condition is met, only that we
* ought to check it.

You seem to be arguing that in this case, it doesn't matter; that
walreceiver knows what walsender is waiting for, and will never wake it
up before it's ready. I don't think that's true, and even if it is, it
needs explanation.

What I think is that, in this particular case, we are sure that
the loop exit condition is met as we know that loc <= RecentFlushPtr.

I agree that's a good idea and that it should/would work too. I just
wanted to highlight that in this particular
case that might not be necessary to build this new API.

In this case it looks easier to add the right API than to be sure about
whether it's needed or not.

What I meant is that of course I might be wrong.

If we do not agree that the new API (in this particular case) is not needed then
I agree that building the new API is the way to go ;-) (+ it offers the advantage to
be able to be more precise while reporting the wait event).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#243

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#241)

Re: Minimal logical decoding on standbys

Hi,

On 3/3/23 8:58 AM, Jeff Davis wrote:

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.

I attached a sketch of one approach.

Oh, that's very cool, thanks a lot!

I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.

I'll look at it early next week.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#244

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#243)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/3/23 5:26 PM, Drouvot, Bertrand wrote:

Hi,

On 3/3/23 8:58 AM, Jeff Davis wrote:

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.

I attached a sketch of one approach.

Oh, that's very cool, thanks a lot!

I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.

I'll look at it early next week.

Just attaching a tiny rebase due to ebd551f586 breaking 0001 (did not look at your patch yet).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v52-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v52-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From fadeda04aefbad8202affeea4167ea915b7e80b3 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v52 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v52-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v52-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 52309916925ef711594834fb13c18c4e329931df Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v52 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 97b3f1c1a6..bf4ef3fa98 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26557,6 +26557,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index abdc6e23f2..39cb3c0d26 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6393,6 +6393,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 3e2a27fb71..da58257f4f 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3060,6 +3060,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..8c45180c35
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v52-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v52-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From bd7bcb08bed0df7c80e86bc22327593116986fea Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v52 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3042e5bd64..5034194e1b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index cb99cc6339..e1b80e6202 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 9ab23e1c4a..548ef41dca 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v52-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v52-0003-Allow-logical-decoding-on-standby.patchDownload

From 56a9559555918a99c202a0924f7b2ede9de4e75d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v52 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index de0cbe5e27..33a1b6f0e4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4467,6 +4467,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8457eec4c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..743d12ba14 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +341,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "access/xlogrecovery.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1177,37 +1178,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1223,6 +1215,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c2523c5caf..3042e5bd64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3074,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3168,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v52-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v52-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 29e5eae2a50d36fe02d38ea0f9db21da5dc5e1ee Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v52 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  24 ++-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 198 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |  13 +-
 src/backend/replication/walsender.c           |   8 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  24 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |   5 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 304 insertions(+), 63 deletions(-)
   5.4% doc/src/sgml/
   7.2% src/backend/access/transam/
   4.7% src/backend/replication/logical/
  56.8% src/backend/replication/
   4.5% src/backend/storage/ipc/
   6.5% src/backend/tcop/
   5.4% src/backend/
   3.9% src/include/catalog/
   3.0% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..cdf7c09b4b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4663,6 +4663,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6c36b3a326..20fb689e7f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8677,6 +8677,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8846,6 +8847,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -8963,6 +8965,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87af608d15..de0cbe5e27 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6447,6 +6447,7 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
+	bool		invalidated = false;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7086,6 +7088,7 @@ CreateRestartPoint(int flags)
 	XLogRecPtr	endptr;
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
+	bool		invalidated = false;
 
 	/* Concurrent checkpoint/restartpoint cannot happen */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
@@ -7251,7 +7254,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7964,6 +7968,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..20c70be5a2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -997,7 +997,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1065,7 +1066,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..38c6f18886 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1245,6 +1248,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);
+				}
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
+
+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}
 
 			/* done with this slot for now */
 			break;
@@ -1388,20 +1460,40 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
-bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
-	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
+	bool logical_slot_invalidated = false;
+
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return;
+
+	if (!xid)
+	{
+		Assert(invalidated);
+		*invalidated = false;
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	}
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1412,24 +1504,36 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteOrConflictingLogicalSlot(s, oldestLSN, invalidated ? invalidated : &logical_slot_invalidated, xid))
+				goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin
+	 * and the required lsn (if appropriate).
 	 */
-	if (invalidated)
+	if ((!xid && *invalidated) || (xid && logical_slot_invalidated))
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid && *invalidated)
+			ReplicationSlotsComputeRequiredLSN();
 	}
-
-	return invalidated;
 }
 
 /*
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..44192bc32d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -404,6 +404,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..c2523c5caf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+		 && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9a73ae67d0..db5c3333cc 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -35,6 +35,7 @@
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "replication/slot.h"
 
 /* User-settable GUC parameters */
 int			vacuum_defer_cleanup_age;
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..b5f9aa285c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2488,6 +2488,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3057,6 +3060,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_LOCK:
 			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
 			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
 
 				/*
 				 * If we aren't in a transaction any longer then ignore.
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b61a12382b..1196716255 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 505595620e..abdc6e23f2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5577,6 +5577,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -10959,9 +10964,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f43fac09ed..3aa0092751 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -338,6 +338,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..236ebcdbdb 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,8 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											 !TransactionIdIsValid(s->data.catalog_xmin))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +217,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, bool *invalidated, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +229,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..1b6600884e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1868,7 +1869,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v52-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v52-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 9e62983cc174a4f645b47d8716dce5750e7a24f4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:55:19 +0000
Subject: [PATCH v52 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  11 +-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 201 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..3ac68ec3b4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1694,7 +1694,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 7a6d93bb87..1f044840d4 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff..a607464b97 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4f50e0dd34..6c36b3a326 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6658,6 +6658,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8228,7 +8229,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8240,6 +8241,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c4b1916d36..392c6e659c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -720,9 +720,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3f7b541e9d..a213407fee 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -834,7 +834,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -870,7 +870,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 02b9601bec..1207a49689 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..05abf36032 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7777e7ec77..98a712f4ec 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3365,6 +3365,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 65750958bb..0178186d38 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 2ce9366277..93fb9d438a 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,11 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -98,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 9894ab9afe..6c5535fe73 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16			ntuples;
+	bool        isCatalogRel;   /* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a2c67d1cd3..08db7e62dd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,13 +346,15 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*
 	 * In payload of blk 0 : FREEZE PLANS and OFFSET NUMBER ARRAY
 	 */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage	(offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -409,7 +413,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7dd67257f2..823c700dee 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,6 +237,8 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*----
 	 * In payload of blk 0 :
@@ -245,7 +249,7 @@ typedef struct xl_btree_delete
 	 */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 67f994cb3e..52845497cc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -27,6 +27,7 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/reltrigger.h"
+#include "catalog/catalog.h"
 
 
 /*
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#245

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#243)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/3/23 5:26 PM, Drouvot, Bertrand wrote:

Hi,

On 3/3/23 8:58 AM, Jeff Davis wrote:

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.

I attached a sketch of one approach.

Oh, that's very cool, thanks a lot!

I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.

I'll look at it early next week.

So, I took your patch and as an example I tried a quick integration in 0004,
(see 0004_new_API.txt attached) to put it in the logical decoding on standby context.

Based on this, I've 3 comments:

- Maybe ConditionVariableEventSleep() should take care of the “WaitEventSetWait returns 1 and cvEvent.event == WL_POSTMASTER_DEATH” case?

- Maybe ConditionVariableEventSleep() could accept and deal with the CV being NULL?
I used it in the POC attached to handle logical decoding on the primary server case.
One option should be to create a dedicated CV for that case though.

- In the POC attached I had to add this extra condition “(cv && !RecoveryInProgress())” to avoid waiting on the timeout when there is a promotion.
That makes me think that we may want to add 2 extra parameters (as 2 functions returning a bool?) to ConditionVariableEventSleep()
to check whether or not we still want to test the socket or the CV wake up in each loop iteration.

Also 3 additional remarks:

1) About InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate(): I'm not sure about the naming as there is no CV yet (they "just" deal with WaitEventSet).

So, what about renaming?

+static WaitEventSet *ConditionVariableWaitSet = NULL;

to say, "LocalWaitSet" and then rename ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate() accordingly?

But it might be not needed (see 3) below).

  /*
   * Prepare to wait on a given condition variable.
   *
@@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
  void
  ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
  {
-       (void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+       (void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
+                                                                          -1 /* no timeout */ ,
                                                                            wait_event_info);
  }

@@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
  bool
  ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
                                                         uint32 wait_event_info)
+{
+       return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
+                                                                          wait_event_info);
+}
+

I like the idea of making use of the new ConditionVariableEventSleep() here, but on the other hand...

I wonder if there is no race conditions: ConditionVariableWaitSet is being initialized with PGINVALID_SOCKET
as WL_LATCH_SET and might be also (if IsUnderPostmaster) be initialized with PGINVALID_SOCKET as WL_EXIT_ON_PM_DEATH.

So IIUC, the patch is introducing 2 new possible source of wake up.

Then, what about?

- not create ConditionVariableWaitSet, ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate() at all?
- call ConditionVariableEventSleep() with a NULL parameter in ConditionVariableSleep() and ConditionVariableTimedSleep()?
- handle the case where the WaitEventSet parameter is NULL in ConditionVariableEventSleep()? (That could also make sense if we handle the case of the CV being NULL as proposed above)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

0004_new_API.txttext/plain; charset=UTF-8; name=0004_new_API.txtDownload

From d4423857bd73c4d87b17a0dac74388f664421e18 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 6 Mar 2023 08:17:52 +0000
Subject: [PATCH v99 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable and a new API ConditionVariableEventSleep()
to fix this corner case.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/transam/xlogrecovery.c     |  28 ++++
 src/backend/libpq/pqcomm.c                    |  14 +-
 src/backend/replication/walsender.c           |  17 ++-
 src/backend/storage/lmgr/condition_variable.c | 124 +++++++++++++++---
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/utils/activity/wait_event.c       |   3 +
 src/backend/utils/init/miscinit.c             |   1 +
 src/include/access/xlogrecovery.h             |   3 +
 src/include/libpq/libpq.h                     |   6 +-
 src/include/replication/walsender.h           |   1 +
 src/include/storage/condition_variable.h      |  10 ++
 src/include/utils/wait_event.h                |   1 +
 13 files changed, 189 insertions(+), 29 deletions(-)
  13.5% src/backend/access/transam/
   8.5% src/backend/libpq/
   6.5% src/backend/replication/
  58.6% src/backend/storage/lmgr/
   4.4% src/include/storage/
   4.4% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index cdf7c09b4b..9af8d58da2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1857,6 +1857,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for startup process to send initial data for streaming
        replication.</entry>
      </row>
+     <row>
+      <entry><literal>WalSenderWaitReplay</literal></entry>
+      <entry>Waiting for startup process to replay write-ahead log.</entry>
+     </row>
      <row>
       <entry><literal>XactGroupUpdate</literal></entry>
       <entry>Waiting for the group leader to update transaction status at
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index da5bb5fc5d..babd0b6c4e 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -80,6 +80,7 @@
 #include "storage/ipc.h"
 #include "utils/guc_hooks.h"
 #include "utils/memutils.h"
+#include "storage/condition_variable.h"
 
 /*
  * Cope with the various platform-specific ways to spell TCP keepalive socket
@@ -172,7 +173,6 @@ void
 pq_init(void)
 {
 	int			socket_pos PG_USED_FOR_ASSERTS_ONLY;
-	int			latch_pos PG_USED_FOR_ASSERTS_ONLY;
 
 	/* initialize state variables */
 	PqSendBufferSize = PQ_SEND_BUFFER_SIZE;
@@ -207,20 +207,14 @@ pq_init(void)
 		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
+	FeBeWaitSet = ConditionVariableWaitSetCreate(TopMemoryContext, FeBeWaitSetNEvents);
 	socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
 								   MyProcPort->sock, NULL, NULL);
-	latch_pos = AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, PGINVALID_SOCKET,
-								  MyLatch, NULL);
-	AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
-					  NULL, NULL);
-
 	/*
-	 * The event positions match the order we added them, but let's sanity
-	 * check them to be sure.
+	 * The socket_pos matches the order we added it, but let's sanity
+	 * check it to be sure.
 	 */
 	Assert(socket_pos == FeBeWaitSetSocketPos);
-	Assert(latch_pos == FeBeWaitSetLatchPos);
 }
 
 /* --------------------------------
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3042e5bd64..89d1a36e6a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1551,7 +1551,9 @@ static XLogRecPtr
 WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
+	uint32		wait_event;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *cv = NULL;
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1564,9 +1566,20 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	/* Get a more recent flush pointer. */
 	if (!RecoveryInProgress())
+	{
 		RecentFlushPtr = GetFlushRecPtr(NULL);
+		wait_event = WAIT_EVENT_WAL_SENDER_WAIT_WAL;
+	}
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		wait_event = WAIT_EVENT_WAL_SENDER_WAIT_REPLAY;
+		cv = check_for_replay();
+	}
+
+	/* Prepare the cv to sleep */
+	if (cv)
+		ConditionVariablePrepareToSleep(cv);
 
 	for (;;)
 	{
@@ -1667,9 +1680,11 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (pq_is_send_pending())
 			wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, wakeEvents, NULL);
+		ConditionVariableEventSleep(cv, FeBeWaitSet, sleeptime, wait_event);
 	}
 
+	ConditionVariableCancelSleep();
 	/* reactivate latch so WalSndLoop knows to continue */
 	SetLatch(MyLatch);
 	return RecentFlushPtr;
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 7e2bbf46d9..766f1bd7b2 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -27,9 +27,29 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#define ConditionVariableWaitSetLatchPos	0
+
 /* Initially, we are not prepared to sleep on any condition variable. */
 static ConditionVariable *cv_sleep_target = NULL;
 
+/* Used by ConditionVariableSleep() and ConditionVariableTimedSleep(). */
+static WaitEventSet *ConditionVariableWaitSet = NULL;
+
+/*
+ * Initialize the process-local condition variable WaitEventSet.
+ *
+ * This must be called once during startup of any process that can wait on
+ * condition variables, before it issues any ConditionVariableInit() calls.
+ */
+void
+InitializeConditionVariableWaitSet(void)
+{
+	Assert(ConditionVariableWaitSet == NULL);
+
+	ConditionVariableWaitSet = ConditionVariableWaitSetCreate(
+		TopMemoryContext, 0);
+}
+
 /*
  * Initialize a condition variable.
  */
@@ -40,6 +60,51 @@ ConditionVariableInit(ConditionVariable *cv)
 	proclist_init(&cv->wakeup);
 }
 
+/*
+ * Create a WaitEventSet for ConditionVariableEventSleep(). This should be
+ * used when the caller of ConditionVariableEventSleep() would like to wake up
+ * on either the condition variable signal or a socket event. For example:
+ *
+ *   ConditionVariableInit(&cv);
+ *   waitset = ConditionVariableWaitSetCreate(mcxt, 1);
+ *   event_pos = AddWaitEventToSet(waitset, 0, sock, NULL, NULL);
+ *   ...
+ *   ConditionVariablePrepareToSleep(&cv);
+ *   while (...condition not met...)
+ *   {
+ *       socket_wait_events = ...
+ *       ModifyWaitEvent(waitset, event_pos, socket_wait_events, NULL);
+ *       ConditionVariableEventSleep(&cv, waitset, ...);
+ *   }
+ *   ConditionVariableCancelSleep();
+ *
+ * The waitset is created with the standard events for a condition variable,
+ * and room for adding n_socket_events additional socket events. The
+ * initially-filled event positions should not be modified, but added socket
+ * events can be modified. The same waitset can be used for multiple condition
+ * variables as long as the callers of ConditionVariableEventSleep() are
+ * interested in the same sockets.
+ */
+WaitEventSet *
+ConditionVariableWaitSetCreate(MemoryContext mcxt, int n_socket_events)
+{
+	int				 latch_pos   PG_USED_FOR_ASSERTS_ONLY;
+	int				 n_cv_events = IsUnderPostmaster ? 2 : 1;
+	int				 nevents	 = n_cv_events + n_socket_events;
+	WaitEventSet    *waitset	 = CreateWaitEventSet(mcxt, nevents);
+
+	latch_pos = AddWaitEventToSet(waitset, WL_LATCH_SET, PGINVALID_SOCKET,
+								  MyLatch, NULL);
+
+	if (IsUnderPostmaster)
+		AddWaitEventToSet(waitset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+						  NULL, NULL);
+
+	Assert(latch_pos == ConditionVariableWaitSetLatchPos);
+
+	return waitset;
+}
+
 /*
  * Prepare to wait on a given condition variable.
  *
@@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
-	(void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+	(void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
+									   -1 /* no timeout */ ,
 									   wait_event_info);
 }
 
@@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 bool
 ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 							uint32 wait_event_info)
+{
+	return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
+									   wait_event_info);
+}
+
+/*
+ * Wait for a condition variable to be signaled, a timeout to be reached, or a
+ * socket event in the given waitset. The waitset must have been created by
+ * ConditionVariableWaitSetCreate().
+ *
+ * Returns true when timeout expires, otherwise returns false.
+ *
+ * See ConditionVariableSleep() for general usage.
+ */
+bool
+ConditionVariableEventSleep(ConditionVariable *cv, WaitEventSet *waitset,
+							long timeout, uint32 wait_event_info)
 {
 	long		cur_timeout = -1;
 	instr_time	start_time;
 	instr_time	cur_time;
-	int			wait_events;
 
 	/*
 	 * If the caller didn't prepare to sleep explicitly, then do so now and
@@ -132,7 +214,7 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 	 * If we are currently prepared to sleep on some other CV, we just cancel
 	 * that and prepare this one; see ConditionVariablePrepareToSleep.
 	 */
-	if (cv_sleep_target != cv)
+	if (cv && cv_sleep_target != cv)
 	{
 		ConditionVariablePrepareToSleep(cv);
 		return false;
@@ -147,24 +229,28 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		INSTR_TIME_SET_CURRENT(start_time);
 		Assert(timeout >= 0 && timeout <= INT_MAX);
 		cur_timeout = timeout;
-		wait_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
 	}
-	else
-		wait_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
 
 	while (true)
 	{
 		bool		done = false;
+		WaitEvent	cvEvent;
+		int			nevents;
 
 		/*
-		 * Wait for latch to be set.  (If we're awakened for some other
-		 * reason, the code below will cope anyway.)
+		 * Wait for latch to be set, or other events which will be handled
+		 * below.
 		 */
-		(void) WaitLatch(MyLatch, wait_events, cur_timeout, wait_event_info);
+		nevents = WaitEventSetWait(waitset, cur_timeout, &cvEvent,
+								   1, wait_event_info);
 
 		/* Reset latch before examining the state of the wait list. */
 		ResetLatch(MyLatch);
 
+		/* If a socket event occurred, no need to check wait list. */
+		if (nevents == 1 && (cvEvent.events & WL_SOCKET_MASK) != 0)
+			return true;
+
 		/*
 		 * If this process has been taken out of the wait list, then we know
 		 * that it has been signaled by ConditionVariableSignal (or
@@ -180,13 +266,21 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		 * by something other than ConditionVariableSignal; though we don't
 		 * guarantee not to return spuriously, we'll avoid this obvious case.
 		 */
-		SpinLockAcquire(&cv->mutex);
-		if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
+
+		if (cv)
 		{
-			done = true;
-			proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
+			SpinLockAcquire(&cv->mutex);
+			if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
+			{
+				done = true;
+				proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
+			}
+			SpinLockRelease(&cv->mutex);
 		}
-		SpinLockRelease(&cv->mutex);
+
+		/* Note for the POC: If we are not waiting on a CV or have just been promoted. */
+		if (!cv || (cv && !RecoveryInProgress()))
+			done = true;
 
 		/*
 		 * Check for interrupts, and return spuriously if that caused the
@@ -194,7 +288,7 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		 * waited for a different condition variable).
 		 */
 		CHECK_FOR_INTERRUPTS();
-		if (cv != cv_sleep_target)
+		if (cv && cv != cv_sleep_target)
 			done = true;
 
 		/* We were signaled, so return */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..ae4a7aecd4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -440,6 +440,9 @@ InitProcess(void)
 	OwnLatch(&MyProc->procLatch);
 	SwitchToSharedLatch();
 
+	/* Initialize process-local condition variable support */
+	InitializeConditionVariableWaitSet();
+
 	/* now that we have a proc, report wait events to shared memory */
 	pgstat_set_wait_event_storage(&MyProc->wait_event_info);
 
@@ -596,6 +599,9 @@ InitAuxiliaryProcess(void)
 	OwnLatch(&MyProc->procLatch);
 	SwitchToSharedLatch();
 
+	/* Initialize process-local condition variable support */
+	InitializeConditionVariableWaitSet();
+
 	/* now that we have a proc, report wait events to shared memory */
 	pgstat_set_wait_event_storage(&MyProc->wait_event_info);
 
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index cb99cc6339..a10dcd4e61 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalSenderWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 7eb7fe87f6..d07d24bc45 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "postmaster/postmaster.h"
+#include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 50fc781f47..33eddc7d40 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -60,9 +60,9 @@ extern const PGDLLIMPORT PQcommMethods *PqCommMethods;
  */
 extern PGDLLIMPORT WaitEventSet *FeBeWaitSet;
 
-#define FeBeWaitSetSocketPos 0
-#define FeBeWaitSetLatchPos 1
-#define FeBeWaitSetNEvents 3
+#define FeBeWaitSetLatchPos 0
+#define FeBeWaitSetSocketPos 2
+#define FeBeWaitSetNEvents 1
 
 extern int	StreamServerPort(int family, const char *hostName,
 							 unsigned short portNumber, const char *unixSocketDir,
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 589bdd323c..94adb54b91 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -22,6 +22,7 @@
 #ifndef CONDITION_VARIABLE_H
 #define CONDITION_VARIABLE_H
 
+#include "storage/latch.h"
 #include "storage/proclist_types.h"
 #include "storage/spin.h"
 
@@ -42,9 +43,14 @@ typedef union ConditionVariableMinimallyPadded
 	char		pad[CV_MINIMAL_SIZE];
 } ConditionVariableMinimallyPadded;
 
+extern void InitializeConditionVariableWaitSet(void);
+
 /* Initialize a condition variable. */
 extern void ConditionVariableInit(ConditionVariable *cv);
 
+extern WaitEventSet *ConditionVariableWaitSetCreate(MemoryContext mcxt,
+													int n_socket_events);
+
 /*
  * To sleep on a condition variable, a process should use a loop which first
  * checks the condition, exiting the loop if it is met, and then calls
@@ -56,6 +62,10 @@ extern void ConditionVariableInit(ConditionVariable *cv);
 extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern bool ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 										uint32 wait_event_info);
+extern bool ConditionVariableEventSleep(ConditionVariable *cv,
+										WaitEventSet *cvEventSet,
+										long timeout,
+										uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);
 
 /*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 9ab23e1c4a..548ef41dca 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

#246

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#245)

2 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/8/23 11:25 AM, Drouvot, Bertrand wrote:

Hi,

On 3/3/23 5:26 PM, Drouvot, Bertrand wrote:

Hi,

On 3/3/23 8:58 AM, Jeff Davis wrote:

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.

I attached a sketch of one approach.

Oh, that's very cool, thanks a lot!

I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.

I'll look at it early next week.

So, I took your patch and as an example I tried a quick integration in 0004,
(see 0004_new_API.txt attached) to put it in the logical decoding on standby context.

Based on this, I've 3 comments:

- Maybe ConditionVariableEventSleep() should take care of the “WaitEventSetWait returns 1 and cvEvent.event == WL_POSTMASTER_DEATH” case?

- Maybe ConditionVariableEventSleep() could accept and deal with the CV being NULL?
I used it in the POC attached to handle logical decoding on the primary server case.
One option should be to create a dedicated CV for that case though.

- In the POC attached I had to add this extra condition “(cv && !RecoveryInProgress())” to avoid waiting on the timeout when there is a promotion.
That makes me think that we may want to add 2 extra parameters (as 2 functions returning a bool?) to ConditionVariableEventSleep()
to check whether or not we still want to test the socket or the CV wake up in each loop iteration.

Also 3 additional remarks:

1) About InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate(): I'm not sure about the naming as there is no CV yet (they "just" deal with WaitEventSet).

So, what about renaming?

+static WaitEventSet *ConditionVariableWaitSet = NULL;

to say, "LocalWaitSet" and then rename ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate() accordingly?

But it might be not needed (see 3) below).

2)
 /*
  * Prepare to wait on a given condition variable.
  *
@@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
-       (void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+       (void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
+                                                                          -1 /* no timeout */ ,
                                                                           wait_event_info);
 }
@@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 bool
 ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
                                                        uint32 wait_event_info)
+{
+       return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
+                                                                          wait_event_info);
+}
+
I like the idea of making use of the new ConditionVariableEventSleep() here, but on the other hand...

3)

I wonder if there is no race conditions: ConditionVariableWaitSet is being initialized with PGINVALID_SOCKET
as WL_LATCH_SET and might be also (if IsUnderPostmaster) be initialized with PGINVALID_SOCKET as WL_EXIT_ON_PM_DEATH.

So IIUC, the patch is introducing 2 new possible source of wake up.

Then, what about?

- not create ConditionVariableWaitSet, ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate() at all?
- call ConditionVariableEventSleep() with a NULL parameter in ConditionVariableSleep() and ConditionVariableTimedSleep()?
- handle the case where the WaitEventSet parameter is NULL in ConditionVariableEventSleep()? (That could also make sense if we handle the case of the CV being NULL as proposed above)

I gave it a try, so please find attached v2-0001-Introduce-ConditionVariableEventSleep.txt (implementing the comments above) and 0004_new_API.txt to put the new API in the logical decoding on standby context.

There is no change in v2-0001-Introduce-ConditionVariableEventSleep.txt regarding the up-thread comment related to WL_POSTMASTER_DEATH.

What do you think?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

0004_new_API.txttext/plain; charset=UTF-8; name=0004_new_API.txtDownload

From 9a820140b7356ab94479499a80fc4742403f3ca5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Fri, 10 Mar 2023 10:58:23 +0000
Subject: [PATCH v99 5/7] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable and a new API ConditionVariableEventSleep()
to fix this corner case.
---
 doc/src/sgml/monitoring.sgml              |  4 ++++
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++++++
 src/backend/replication/walsender.c       | 18 ++++++++++++++-
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h         |  3 +++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 7 files changed, 57 insertions(+), 1 deletion(-)
   7.8% doc/src/sgml/
  52.1% src/backend/access/transam/
  27.1% src/backend/replication/
   4.5% src/backend/utils/activity/
   4.5% src/include/access/
   3.7% src/include/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index cdf7c09b4b..9af8d58da2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1857,6 +1857,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for startup process to send initial data for streaming
        replication.</entry>
      </row>
+     <row>
+      <entry><literal>WalSenderWaitReplay</literal></entry>
+      <entry>Waiting for startup process to replay write-ahead log.</entry>
+     </row>
      <row>
       <entry><literal>XactGroupUpdate</literal></entry>
       <entry>Waiting for the group leader to update transaction status at
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3042e5bd64..8ef22616bb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1551,7 +1551,9 @@ static XLogRecPtr
 WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
+	uint32		wait_event;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *cv = NULL;
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1564,9 +1566,20 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 	/* Get a more recent flush pointer. */
 	if (!RecoveryInProgress())
+	{
 		RecentFlushPtr = GetFlushRecPtr(NULL);
+		wait_event = WAIT_EVENT_WAL_SENDER_WAIT_WAL;
+	}
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		wait_event = WAIT_EVENT_WAL_SENDER_WAIT_REPLAY;
+		cv = check_for_replay();
+	}
+
+	/* Prepare the cv to sleep */
+	if (cv)
+		ConditionVariablePrepareToSleep(cv);
 
 	for (;;)
 	{
@@ -1667,9 +1680,12 @@ WalSndWaitForWal(XLogRecPtr loc)
 		if (pq_is_send_pending())
 			wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, wakeEvents, NULL);
+		ConditionVariableEventSleep(cv, RecoveryInProgress, FeBeWaitSet, NULL,
+									sleeptime, wait_event);
 	}
 
+	ConditionVariableCancelSleep();
 	/* reactivate latch so WalSndLoop knows to continue */
 	SetLatch(MyLatch);
 	return RecentFlushPtr;
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index cb99cc6339..a10dcd4e61 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalSenderWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..2bfeaaa00f 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -15,6 +15,7 @@
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
 #include "utils/timestamp.h"
+#include "storage/condition_variable.h"
 
 /*
  * Recovery target type.
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 9ab23e1c4a..548ef41dca 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v2-0001-Introduce-ConditionVariableEventSleep.txttext/plain; charset=UTF-8; name=v2-0001-Introduce-ConditionVariableEventSleep.txtDownload

From 0044078a540fcb2b5f5c728dcb7e4911b000d6d5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Fri, 10 Mar 2023 10:57:22 +0000
Subject: [PATCH v99 4/7] Introduce-ConditionVariableEventSleep

---
 src/backend/storage/lmgr/condition_variable.c | 65 ++++++++++++++-----
 src/include/storage/condition_variable.h      |  7 ++
 2 files changed, 57 insertions(+), 15 deletions(-)
  89.0% src/backend/storage/lmgr/
  10.9% src/include/storage/

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 7e2bbf46d9..af241e7317 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -97,7 +97,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
-	(void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+	(void) ConditionVariableEventSleep(cv, NULL, NULL, NULL,
+									   -1 /* no timeout */ ,
 									   wait_event_info);
 }
 
@@ -111,11 +112,28 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 bool
 ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 							uint32 wait_event_info)
+{
+	return ConditionVariableEventSleep(cv, NULL, NULL, NULL, timeout,
+									   wait_event_info);
+}
+
+/*
+ * Wait for a condition variable to be signaled, a timeout to be reached, or a
+ * socket event in the given waitset.
+ *
+ * Returns true when timeout expires, otherwise returns false.
+ *
+ * See ConditionVariableSleep() for general usage.
+ */
+bool
+ConditionVariableEventSleep(ConditionVariable *cv, bool (*cv_resume_waiting)(void),
+							WaitEventSet *waitset,
+							bool (*waitset_resume_waiting)(void),
+							long timeout, uint32 wait_event_info)
 {
 	long		cur_timeout = -1;
 	instr_time	start_time;
 	instr_time	cur_time;
-	int			wait_events;
 
 	/*
 	 * If the caller didn't prepare to sleep explicitly, then do so now and
@@ -132,7 +150,7 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 	 * If we are currently prepared to sleep on some other CV, we just cancel
 	 * that and prepare this one; see ConditionVariablePrepareToSleep.
 	 */
-	if (cv_sleep_target != cv)
+	if (cv && cv_sleep_target != cv)
 	{
 		ConditionVariablePrepareToSleep(cv);
 		return false;
@@ -147,24 +165,29 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		INSTR_TIME_SET_CURRENT(start_time);
 		Assert(timeout >= 0 && timeout <= INT_MAX);
 		cur_timeout = timeout;
-		wait_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
 	}
-	else
-		wait_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
 
 	while (true)
 	{
 		bool		done = false;
+		WaitEvent	cvEvent;
+		int			nevents = 0;
 
 		/*
-		 * Wait for latch to be set.  (If we're awakened for some other
-		 * reason, the code below will cope anyway.)
+		 * Wait for latch to be set, or other events which will be handled
+		 * below.
 		 */
-		(void) WaitLatch(MyLatch, wait_events, cur_timeout, wait_event_info);
+		if (waitset)
+			nevents = WaitEventSetWait(waitset, cur_timeout, &cvEvent,
+									   1, wait_event_info);
 
 		/* Reset latch before examining the state of the wait list. */
 		ResetLatch(MyLatch);
 
+		/* If a socket event occurred, no need to check wait list. */
+		if (nevents == 1 && (cvEvent.events & WL_SOCKET_MASK) != 0)
+			return true;
+
 		/*
 		 * If this process has been taken out of the wait list, then we know
 		 * that it has been signaled by ConditionVariableSignal (or
@@ -180,13 +203,25 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		 * by something other than ConditionVariableSignal; though we don't
 		 * guarantee not to return spuriously, we'll avoid this obvious case.
 		 */
-		SpinLockAcquire(&cv->mutex);
-		if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
+
+		if (cv)
 		{
-			done = true;
-			proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
+			SpinLockAcquire(&cv->mutex);
+			if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
+			{
+				done = true;
+				proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
+			}
+			SpinLockRelease(&cv->mutex);
 		}
-		SpinLockRelease(&cv->mutex);
+
+		/* If we are not waiting on a CV or don't want to wait anymore */
+		if (!cv || (cv && cv_resume_waiting && !cv_resume_waiting()))
+			done = true;
+
+		/* If we don't want to wait on the waitset anymore */
+		if (waitset && waitset_resume_waiting && !waitset_resume_waiting())
+			done = true;
 
 		/*
 		 * Check for interrupts, and return spuriously if that caused the
@@ -194,7 +229,7 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		 * waited for a different condition variable).
 		 */
 		CHECK_FOR_INTERRUPTS();
-		if (cv != cv_sleep_target)
+		if (cv && cv != cv_sleep_target)
 			done = true;
 
 		/* We were signaled, so return */
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 589bdd323c..b9510caa17 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -22,6 +22,7 @@
 #ifndef CONDITION_VARIABLE_H
 #define CONDITION_VARIABLE_H
 
+#include "storage/latch.h"
 #include "storage/proclist_types.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,12 @@ extern void ConditionVariableInit(ConditionVariable *cv);
 extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern bool ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 										uint32 wait_event_info);
+extern bool ConditionVariableEventSleep(ConditionVariable *cv,
+										bool (*cv_resume_waiting)(void),
+										WaitEventSet *cvEventSet,
+										bool (*waitset_resume_waiting)(void),
+										long timeout,
+										uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);
 
 /*
-- 
2.34.1

#247

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#241)

Re: Minimal logical decoding on standbys

On Thu, 2023-03-02 at 23:58 -0800, Jeff Davis wrote:

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.

I attached a sketch of one approach. I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.

Another approach might be to extend WaitEventSets() to be able to wait
on Condition Variables, rather than Condition Variables waiting on
WaitEventSets. Thoughts?

Regards,
Jeff Davis

#248

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#244)

Re: Minimal logical decoding on standbys

Hi,

On 2023-03-04 12:19:57 +0100, Drouvot, Bertrand wrote:

Subject: [PATCH v52 1/6] Add info in WAL records in preparation for logical
slot conflict handling.

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
the primary and the standby. Then, hot_standby_feedback will work,
but only while the connection is alive (for example a node restart
would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

This is a very nice commit message.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot

I think you're first author on this one by now.

I think this commit is ready to go. Unless somebody thinks differently, I
think I might push it tomorrow.

Subject: [PATCH v52 2/6] Handle logical slot conflicts on standby.

@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
{
/*
* Some slots have been invalidated; recalculate the old-segment

I don't really understand why you changed InvalidateObsoleteReplicationSlots
to return void instead of bool, and then added an output boolean argument via
a pointer?

@@ -7964,6 +7968,22 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));

+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+

Are there races around changing wal_level?

@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
SpinLockAcquire(&s->mutex);
effective_xmin = s->effective_xmin;
effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
SpinLockRelease(&s->mutex);

/* invalidated slots need not apply */

I still would like a wrapper function to determine whether a slot has been
invalidated. This This is too complicated to be repeated in other places.

@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
}

/*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
*
* Returns whether ReplicationSlotControlLock was released in the interim (and
* in that case we're not holding the lock at return, otherwise we are).
*
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
*
* This is inherently racy, because we release the LWLock
* for syscalls, so caller must restart if we return true.
*/
static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)

This is too long a name. I'd probably just leave it at the old name.

@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
* Check if the slot needs to be invalidated. If it needs to be
* invalidated, and is not currently acquired, acquire it and mark it
* as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
*/
SpinLockAcquire(&s->mutex);

restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||

Uh, huh?

That's very odd formatting.

/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
*/
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
{
SpinLockRelease(&s->mutex);
if (released_lock)

This needs some cleanup.

@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
{
MyReplicationSlot = s;
s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
/* Let caller know */
*invalidated = true;
}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
*/
if (last_signaled_pid != active_pid)
{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);

I think it ought be possible to deduplicate this a fair bit. For one, two of
the errmsg()s above are identical. But I think this could be consolidated
further, e.g. by using the same message style for the three cases, and passing
in a separately translated reason for the termination?

+ }

- (void) kill(active_pid, SIGTERM);
last_signaled_pid = active_pid;
}

@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
ReplicationSlotSave();
ReplicationSlotRelease();
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);

Why is this done here now?

+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}

I don't like all these repeated elogs...

@@ -3057,6 +3060,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}

/*
* If we aren't in a transaction any longer then ignore.

I can't see any reason for this to be mixed into the same case "body" as LOCK
etc?

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
#include "storage/proc.h"
#include "storage/procarray.h"
#include "utils/builtins.h"
+#include "access/xlogrecovery.h"

Add new includes in the "alphabetically" right place...

Greetings,

Andres Freund

#249

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#248)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/30/23 9:04 AM, Andres Freund wrote:

Hi,

On 2023-03-04 12:19:57 +0100, Drouvot, Bertrand wrote:

Subject: [PATCH v52 1/6] Add info in WAL records in preparation for logical
slot conflict handling.

This is a very nice commit message.

Thanks! Melanie and Robert did provide great feedback/input to help make it
as it is now.

I think this commit is ready to go. Unless somebody thinks differently, I
think I might push it tomorrow.

Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
once 0001 is committed).

Subject: [PATCH v52 2/6] Handle logical slot conflicts on standby.
@@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
+	if (invalidated)
{
/*
* Some slots have been invalidated; recalculate the old-segment
I don't really understand why you changed InvalidateObsoleteReplicationSlots
to return void instead of bool, and then added an output boolean argument via
a pointer?

I gave a second thought and it looks like I over complicated that part. I removed the
pointer parameter in V53 attached (and it now returns bool as before).

@@ -7964,6 +7968,22 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));

+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
+		}
+

Are there races around changing wal_level?

Humm, not that I can think of right now. Do you have one/some in mind?

@@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
SpinLockAcquire(&s->mutex);
effective_xmin = s->effective_xmin;
effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+						XLogRecPtrIsInvalid(s->data.restart_lsn))
+					   || (!TransactionIdIsValid(s->data.xmin) &&
+						   !TransactionIdIsValid(s->data.catalog_xmin)));
SpinLockRelease(&s->mutex);

/* invalidated slots need not apply */

I still would like a wrapper function to determine whether a slot has been
invalidated. This This is too complicated to be repeated in other places.

Agree, so adding ObsoleteSlotIsInvalid() and SlotIsInvalid() in V53 attached.

ObsoleteSlotIsInvalid() could also be done in a dedicated patch outside this patch series, though.

@@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
}

/*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
*
* Returns whether ReplicationSlotControlLock was released in the interim (and
* in that case we're not holding the lock at return, otherwise we are).
*
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
*
* This is inherently racy, because we release the LWLock
* for syscalls, so caller must restart if we return true.
*/
static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+												   bool *invalidated, TransactionId *xid)

This is too long a name. I'd probably just leave it at the old name.

Done in V53 attached.

@@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
* Check if the slot needs to be invalidated. If it needs to be
* invalidated, and is not currently acquired, acquire it and mark it
* as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
*/
SpinLockAcquire(&s->mutex);

restart_lsn = s->data.restart_lsn;
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* slot has been invalidated (logical decoding conflict case) */
+		if ((xid &&
+			 ((LogicalReplicationSlotIsInvalid(s))
+			  ||

Uh, huh?

That's very odd formatting.

/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * We are not forcing for invalidation because the xid is valid and
+		 * this is a non conflicting slot.
*/
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+			  (TransactionIdIsValid(*xid) && !(
+											   (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, *xid))
+											   ||
+											   (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, *xid))
+											   ))
+			  ))
+			||
+		/* slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
{
SpinLockRelease(&s->mutex);
if (released_lock)

This needs some cleanup.

Added a new macro LogicalReplicationSlotXidsConflict() and reformatted a bit.
Also ran pgindent on it, hope it's cleaner now.

@@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
{
MyReplicationSlot = s;
s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
/* Let caller know */
*invalidated = true;
}
@@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
*/
if (last_signaled_pid != active_pid)
{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				if (xid)
+				{
+					if (TransactionIdIsValid(*xid))
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("The slot conflicted with xid horizon %u.",
+										  *xid));
+					}
+					else
+					{
+						ereport(LOG,
+								errmsg("terminating process %d because replication slot \"%s\" conflicts with recovery",
+									   active_pid, NameStr(slotname)),
+								errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+					}
+
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("terminating process %d to release replication slot \"%s\"",
+								   active_pid, NameStr(slotname)),
+							errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									  LSN_FORMAT_ARGS(restart_lsn),
+									  (unsigned long long) (oldestLSN - restart_lsn)),
+							errhint("You might need to increase max_slot_wal_keep_size."));
+
+					(void) kill(active_pid, SIGTERM);

deduplication done in V53 so that there is a single ereport() call.
I'm not sure the translation is fine the way I did it, please advice if that's not right.

+ }

- (void) kill(active_pid, SIGTERM);
last_signaled_pid = active_pid;
}

@@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
ReplicationSlotSave();
ReplicationSlotRelease();
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			if (xid)
+			{
+				pgstat_drop_replslot(s);
Why is this done here now?

Oops, moved above the if() in V53.

+				if (TransactionIdIsValid(*xid))
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("The slot conflicted with xid horizon %u.", *xid));
+				}
+				else
+				{
+					ereport(LOG,
+							errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname)),
+							errdetail("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+				}
+			}
+			else
+			{
+				ereport(LOG,
+						errmsg("invalidating obsolete replication slot \"%s\"",
+							   NameStr(slotname)),
+						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								  LSN_FORMAT_ARGS(restart_lsn),
+								  (unsigned long long) (oldestLSN - restart_lsn)),
+						errhint("You might need to increase max_slot_wal_keep_size."));
+			}

I don't like all these repeated elogs...

deduplication done in V53 so that there is a single ereport() call.
I'm not sure the translation is fine the way I did it, please advice if that's not right.

@@ -3057,6 +3060,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}

/*
* If we aren't in a transaction any longer then ignore.

I can't see any reason for this to be mixed into the same case "body" as LOCK
etc?

Oh right, nice catch. I don't know how it ended up done that way. Fixed in V53.

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 38c6f18886..290d4b45f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -51,6 +51,7 @@
#include "storage/proc.h"
#include "storage/procarray.h"
#include "utils/builtins.h"
+#include "access/xlogrecovery.h"

Add new includes in the "alphabetically" right place...

Fixed in 0003 in V53 and the other places (aka other sub-patches) where it was needed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v53-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v53-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From b04c5edbef19b9f01e1f4d4a9e00635bc5b59e9d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v53 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v53-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v53-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From a7e0f50f26e5839090f440e5134fec071196d90c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v53 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 38e7f46760..f15d647eac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26954,6 +26954,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 06b36635f9..9f9fce7388 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..8c45180c35
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v53-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v53-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From ef687e0b58a5eae8b23b165f0ff450a3044ebd82 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v53 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a05ecdeddd..2a0d6c787d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1551,6 +1551,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1565,10 +1566,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1652,21 +1658,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 7940d64639..27fc8fab76 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..c8842d94bb 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -14,6 +14,7 @@
 #include "access/xlogreader.h"
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
+#include "storage/condition_variable.h"
 #include "utils/timestamp.h"
 
 /*
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 518d3b0a1f..e13abdc365 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v53-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v53-0003-Allow-logical-decoding-on-standby.patchDownload

From 2ad49416952f2694b691b3fdeb1191aa1e903d8c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v53 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c122f85339..771458418e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4467,6 +4467,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8457eec4c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..743d12ba14 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +341,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5ff843047c..b7599808fc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1176,37 +1177,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1222,6 +1214,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..a05ecdeddd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3073,10 +3081,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3179,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v53-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v53-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 38938a4d203c8e8776a7ba7e22f5d17b6751b0e2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v53 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  20 +-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 189 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |  16 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |  14 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 308 insertions(+), 61 deletions(-)
   5.3% doc/src/sgml/
   6.2% src/backend/access/transam/
   4.6% src/backend/replication/logical/
  55.6% src/backend/replication/
   4.4% src/backend/storage/ipc/
   6.9% src/backend/tcop/
   5.3% src/backend/
   3.8% src/include/catalog/
   5.3% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c809ff1ba4..dad7480b06 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4663,6 +4663,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..239f713295 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2516,6 +2516,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 870ed9d191..288bbc5397 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8886,6 +8887,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_IS_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9003,6 +9005,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..c122f85339 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6807,7 +6807,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7251,7 +7251,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7964,6 +7964,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9508d8ba55..9c2b46cdca 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..5ff843047c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = SlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1226,20 +1225,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1247,6 +1247,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		StringInfoData err_msg;
+		StringInfoData err_detail;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1268,23 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
-
-		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
-		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* the slot has been invalidated (logical decoding conflict case) */
+		if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
+		/* or the xid is valid and this is a non conflicting slot */
+					 (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin, slot_catalog_xmin, *xid))))) ||
+		/* or the slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1303,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,16 +1345,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
+				bool		send_signal = false;
+
+				initStringInfo(&err_msg);
+				initStringInfo(&err_detail);
+
+				appendStringInfo(&err_msg, "terminating process %d to release replication slot \"%s\"",
+								 active_pid,
+								 NameStr(slotname));
+				if (xid)
+				{
+					appendStringInfo(&err_msg, " because it conflicts with recovery");
+					send_signal = true;
+
+					if (TransactionIdIsValid(*xid))
+						appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+					else
+						appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+				}
+				else
+				{
+					appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									 LSN_FORMAT_ARGS(restart_lsn),
+									 (unsigned long long) (oldestLSN - restart_lsn));
+				}
+
 				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+						errmsg("%s", err_msg.data),
+						errdetail("%s", err_detail.data),
+						send_signal ? 0 : errhint("You might need to increase max_slot_wal_keep_size."));
+
+				if (send_signal)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
+
+				pfree(err_msg.data);
+				pfree(err_detail.data);
 			}
 
 			/* Wait until the slot is released. */
@@ -1355,6 +1400,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		}
 		else
 		{
+			bool		hint = false;;
+
+			initStringInfo(&err_msg);
+			initStringInfo(&err_detail);
+
 			/*
 			 * We hold the slot now and have already invalidated it; flush it
 			 * to ensure that state persists.
@@ -1370,14 +1420,37 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			appendStringInfo(&err_msg, "invalidating");
+
+			if (xid)
+			{
+				if (TransactionIdIsValid(*xid))
+					appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+				else
+					appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+			}
+			else
+			{
+				appendStringInfo(&err_msg, " obsolete replication");
+				appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								 LSN_FORMAT_ARGS(restart_lsn),
+								 (unsigned long long) (oldestLSN - restart_lsn));
+				hint = true;
+			}
+
+			appendStringInfo(&err_msg, " slot \"%s\" because it conflicts with recovery",
+							 NameStr(slotname));
+
 
 			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+					errmsg("%s", err_msg.data),
+					errdetail("%s", err_detail.data),
+					hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+			pfree(err_msg.data);
+			pfree(err_detail.data);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1463,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
+ *
+ * Obsolete case (aka xid is NULL):
  *
- * Returns true when any slot have got invalidated.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1503,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..4207d14ae0 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
@@ -404,6 +403,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..e1c45eb2ad 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2488,6 +2488,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0ab31ec6e3..af1a7bd9ff 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5736c1082c..06b36635f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c2bae8358a..975bd50203 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -330,6 +330,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..013cd2b4d0 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,17 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define ObsoleteSlotIsInvalid(s) (!XLogRecPtrIsInvalid(s->data.invalidated_at) && \
+								  XLogRecPtrIsInvalid(s->data.restart_lsn))
+
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											!TransactionIdIsValid(s->data.catalog_xmin))
+
+#define SlotIsInvalid(s) (ObsoleteSlotIsInvalid(s) || LogicalReplicationSlotIsInvalid (s))
+
+#define LogicalReplicationSlotXidsConflict(slot_xmin, catalog_xmin, xid) \
+		((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) || \
+		(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid)))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +226,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +238,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c71758db46..e8ab9e1b3f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v53-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v53-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 5b1fae419df7f5514f4aa2b9dd4a1ac9f636fe73 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:55:19 +0000
Subject: [PATCH v53 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |  17 ++--
 src/backend/access/hash/hash_xlog.c        |  12 +--
 src/backend/access/hash/hashinsert.c       |   1 +
 src/backend/access/heap/heapam.c           |   5 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/pruneheap.c        |   1 +
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 111 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 ++++++----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   9 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/gistxlog.h              |  11 +-
 src/include/access/hash_xlog.h             |   8 +-
 src/include/access/heapam_xlog.h           |  10 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/access/nbtxlog.h               |   8 +-
 src/include/access/spgxlog.h               |   2 +
 src/include/access/visibilitymapdefs.h     |  10 +-
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 35 files changed, 263 insertions(+), 201 deletions(-)
   3.3% contrib/amcheck/
   4.7% src/backend/access/gist/
   4.1% src/backend/access/heap/
  59.0% src/backend/access/nbtree/
   3.7% src/backend/access/
  22.0% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ea72bcce1b..c3a3d49bca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -349,7 +349,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1695,7 +1695,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 57fe553a5c..d2f8da5b02 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff..a607464b97 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -597,7 +591,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -608,6 +603,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8abc101c8c..870ed9d191 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6698,6 +6698,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8268,7 +8269,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
@@ -8280,6 +8281,8 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_IS_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9e690074e9..97b5daee92 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -822,9 +822,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -868,7 +869,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +944,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1294,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
@@ -1684,8 +1686,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1708,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1720,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1766,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1775,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1964,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2006,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2069,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2088,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2124,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2148,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2368,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2395,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2445,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2509,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2538,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2779,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2796,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2896,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a68dd07534..97a39b0f65 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -835,7 +835,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -871,7 +871,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 02b9601bec..1207a49689 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..05abf36032 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
@@ -643,13 +644,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +720,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6aec1b1bca..ce0acf5890 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3364,6 +3364,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 65750958bb..0178186d38 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 2ce9366277..93fb9d438a 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,11 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -98,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 9894ab9afe..6c5535fe73 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16			ntuples;
+	bool        isCatalogRel;   /* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a2c67d1cd3..08db7e62dd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,13 +346,15 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*
 	 * In payload of blk 0 : FREEZE PLANS and OFFSET NUMBER ARRAY
 	 */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage	(offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -409,7 +413,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7dd67257f2..823c700dee 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,6 +237,8 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*----
 	 * In payload of blk 0 :
@@ -245,7 +249,7 @@ typedef struct xl_btree_delete
 	 */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..7306a1c3ee 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c0ddddb2f0..31f84e90eb 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#250

Masahiko Sawada

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#247)

Re: Minimal logical decoding on standbys

Hi,

On Thu, Mar 30, 2023 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-03-02 at 23:58 -0800, Jeff Davis wrote:

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.

I attached a sketch of one approach. I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.

Another approach might be to extend WaitEventSets() to be able to wait
on Condition Variables, rather than Condition Variables waiting on
WaitEventSets. Thoughts?

+1 to extend CV. If we extend WaitEventSet() to be able to wait on CV,
it would be able to make the code simple, but we would need to change
both CV and WaitEventSet().

On Fri, Mar 10, 2023 at 8:34 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

I gave it a try, so please find attached v2-0001-Introduce-ConditionVariableEventSleep.txt (implementing the comments above) and 0004_new_API.txt to put the new API in the logical decoding on standby context.

@@ -180,13 +203,25 @@ ConditionVariableTimedSleep(ConditionVariable
*cv, long timeout,
                 * by something other than ConditionVariableSignal;
though we don't
                 * guarantee not to return spuriously, we'll avoid
this obvious case.
                 */
-               SpinLockAcquire(&cv->mutex);
-               if (!proclist_contains(&cv->wakeup, MyProc->pgprocno,
cvWaitLink))
+
+               if (cv)
                {
-                       done = true;
-                       proclist_push_tail(&cv->wakeup,
MyProc->pgprocno, cvWaitLink);
+                       SpinLockAcquire(&cv->mutex);
+                       if (!proclist_contains(&cv->wakeup,
MyProc->pgprocno, cvWaitLink))
+                       {
+                               done = true;
+                               proclist_push_tail(&cv->wakeup,
MyProc->pgprocno, cvWaitLink);
+                       }
+                       SpinLockRelease(&cv->mutex);
                }

This change looks odd to me since it accepts cv being NULL in spite of
calling ConditionVariableEventSleep() for cv. I think that this is
because in 0004_new_API.txt, we use ConditionVariableEventSleep() in
both not-in-recovery case and recovery-in-progress cases in
WalSndWaitForWal() as follows:

-               WalSndWait(wakeEvents, sleeptime,
WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+               ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos,
wakeEvents, NULL);
+               ConditionVariableEventSleep(cv, RecoveryInProgress,
FeBeWaitSet, NULL,
+
 sleeptime, wait_event);
        }

But I don't think we need to use ConditionVariableEventSleep() in
not-in-recovery cases. If I correctly understand the problem this
patch wants to deal with, in logical decoding on standby cases, the
walsender needs to be woken up on the following events:

* condition variable
* timeout
* socket writable (if pq_is_send_pending() is true)
(socket readable event should also be included to avoid
wal_receiver_timeout BTW?)

On the other hand, in not-in-recovery case, the events are:

* socket readable
* socket writable (if pq_is_send_pending() is true)
* latch
* timeout

I think that we don't need to change for the latter case as
WalSndWait() perfectly works. As for the former cases, since we need
to wait for CV, timeout, or socket writable we can use
ConditionVariableEventSleep().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#251

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#249)

Re: Minimal logical decoding on standbys

Hi,

On 2023-03-30 18:23:41 +0200, Drouvot, Bertrand wrote:

On 3/30/23 9:04 AM, Andres Freund wrote:

I think this commit is ready to go. Unless somebody thinks differently, I
think I might push it tomorrow.

Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
once 0001 is committed).

Unfortunately I did find an issue doing a pre-commit review of the patch.

The patch adds VISIBILITYMAP_IS_CATALOG_REL to xl_heap_visible.flags - but it
does not remove the bit before calling visibilitymap_set().

This ends up corrupting the visibilitymap, because the we'll set a bit for
another page.

It's unfortunate that visibilitymap_set() doesn't assert that just the correct
bits are passed in. It does assert that at least one valid bit is set, but
that's not enough, as this case shows.

I noticed this when looking into the changes to visibilitymapdefs.h in more
detail. I don't like how it ends up in the patch:

--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,11 @@
#define BITS_PER_HEAPBLOCK 2

/* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+#define VISIBILITYMAP_IS_CATALOG_REL							0x04	/* to handle recovery conflict during logical
+																		 * decoding on standby */

On a casual read, one very well might think that VISIBILITYMAP_IS_CATALOG_REL
is a valid bit that could be set in the VM.

I am thinking of instead creating a separate namespace for the "xlog only"
bits:

/*
* To detect recovery conflicts during logical decoding on a standby, we need
* to know if a table is a user catalog table. For that we add an additional
* bit into xl_heap_visible.flags, in addition to the above.
*
* NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
*/
#define VISIBILITYMAP_XLOG_CATALOG_REL 0x04
#define VISIBILITYMAP_XLOG_VALID_BITS (VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)

That allows heap_xlog_visible() to do:

Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
vmbits = (xlrec->flags & VISIBILITYMAP_VALID_BITS);

and pass vmbits istead of xlrec->flags to visibilitymap_set().

I'm also thinking of splitting the patch into two. One patch to pass down the
heap relation into the new places, and another for the rest. As evidenced
above, looking at the actual behaviour changes is important...

Given how the patch changes the struct for XLOG_GIST_DELETE:

diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 2ce9366277..93fb9d438a 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,11 +51,14 @@ typedef struct gistxlogDelete
{
TransactionId snapshotConflictHorizon;
uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */

-	/* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} gistxlogDelete;

-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	offsetof(gistxlogDelete, offsets)

and XLOG_HASH_VACUUM_ONE_PAGE:

diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 9894ab9afe..6c5535fe73 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
{
TransactionId snapshotConflictHorizon;
uint16			ntuples;
+	bool        isCatalogRel;   /* to handle recovery conflict during logical
+								 * decoding on standby */

-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} xl_hash_vacuum_one_page;

-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)

I don't think the changes are quite sufficient:

for gist:

@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
*/
XLogRecPtr
gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
{
gistxlogDelete xlrec;
XLogRecPtr	recptr;
+ xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
xlrec.ntodelete = ntodelete;

Note that gistXLogDelete() continues to register data with two different
XLogRegisterData() calls. This will append data without any padding:

XLogRegisterData((char *) &xlrec, SizeOfGistxlogDelete);

/*
* We need the target-offsets array whether or not we store the whole
* buffer, to allow us to find the snapshotConflictHorizon on a standby
* server.
*/
XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));

But replay now uses the new offset member:

@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
Buffer		buffer;
Page		page;
+	OffsetNumber *toDelete = xldata->offsets;

/*
* If we have any conflict processing to do, it must happen before we

That doesn't look right. If there's any padding before offsets, we'll afaict
read completely bogus data?

As it turns out, there is padding:

struct gistxlogDelete {
TransactionId snapshotConflictHorizon; /* 0 4 */
uint16 ntodelete; /* 4 2 */
_Bool isCatalogRel; /* 6 1 */

/* XXX 1 byte hole, try to pack */

OffsetNumber offsets[]; /* 8 0 */

/* size: 8, cachelines: 1, members: 4 */
/* sum members: 7, holes: 1, sum holes: 1 */
/* last cacheline: 8 bytes */
};

I am frankly baffled how this works at all, this should just about immediately
crash?

Oh, I see. We apparently don't reach the gist deletion code in the tests:
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174

And indeed, if I add an abort() into , it's not reached.

And it's not because tests use a temp table, the caller is also unreachable:
https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643

Whut?

And the same issue exists for hash as well.

Logging:
XLogRegisterData((char *) &xlrec, SizeOfHashVacuumOnePage);

/*
* We need the target-offsets array whether or not we store the
* whole buffer, to allow us to find the snapshotConflictHorizon
* on a standby server.
*/
XLogRegisterData((char *) deletable,
ndeletable * sizeof(OffsetNumber));

Redo:

xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+ toDelete = xldata->offsets;

And there also are no tests:
https://coverage.postgresql.org/src/backend/access/hash/hashinsert.c.gcov.html#372

I'm not going to commit a nontrivial change to these WAL records without some
minimal tests.

Greetings,

Andres Freund

#252

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#245)

1 attachment(s)

Re: Minimal logical decoding on standbys

On Wed, 2023-03-08 at 11:25 +0100, Drouvot, Bertrand wrote:

- Maybe ConditionVariableEventSleep() should take care of the
“WaitEventSetWait returns 1 and cvEvent.event == WL_POSTMASTER_DEATH”
case?

Thank you, done. I think the nearby line was also wrong, returning true
when there was no timeout. I combined the lines and got rid of the
early return so it can check the list and timeout condition like
normal. Attached.

- Maybe ConditionVariableEventSleep() could accept and deal with the
CV being NULL?
I used it in the POC attached to handle logical decoding on the
primary server case.
One option should be to create a dedicated CV for that case though.

I don't think it's a good idea to have a CV-based API that doesn't need
a CV. Wouldn't that just be a normal WaitEventSet?

- In the POC attached I had to add this extra condition “(cv &&
!RecoveryInProgress())” to avoid waiting on the timeout when there is
a promotion.
That makes me think that we may want to add 2 extra parameters (as 2
functions returning a bool?) to ConditionVariableEventSleep()
to check whether or not we still want to test the socket or the CV
wake up in each loop iteration.

That seems like a complex API. Would it work to signal the CV during
promotion instead?

Also 3 additional remarks:

1) About InitializeConditionVariableWaitSet() and
ConditionVariableWaitSetCreate(): I'm not sure about the naming as
there is no CV yet (they "just" deal with WaitEventSet).

It's a WaitEventSet that contains the conditions always required for
any CV, and allows you to add in more.

3)

I wonder if there is no race conditions: ConditionVariableWaitSet is
being initialized with PGINVALID_SOCKET
as WL_LATCH_SET and might be also (if IsUnderPostmaster) be
initialized with PGINVALID_SOCKET as WL_EXIT_ON_PM_DEATH.

So IIUC, the patch is introducing 2 new possible source of wake up.

Those should be the same conditions already required by
ConditionVariableTimedSleep() in master, right?

Regards,
Jeff Davis

Attachments:

v2-0001-Introduce-ConditionVariableEventSleep.patchtext/x-patch; charset=UTF-8; name=v2-0001-Introduce-ConditionVariableEventSleep.patchDownload

From 2f05cab9012950ae9290fccbb6366d50fc01553e Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 1 Mar 2023 20:02:42 -0800
Subject: [PATCH v2] Introduce ConditionVariableEventSleep().

The new API takes a WaitEventSet which can include socket events. The
WaitEventSet must have been created by
ConditionVariableWaitSetCreate(), another new function, so that it
includes the wait events necessary for a condition variable.
---
 src/backend/storage/lmgr/condition_variable.c | 106 ++++++++++++++++--
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/utils/init/miscinit.c             |   1 +
 src/include/storage/condition_variable.h      |  10 ++
 4 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 7e2bbf46d9..3dbfa7468b 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -27,9 +27,29 @@
 #include "storage/spin.h"
 #include "utils/memutils.h"
 
+#define ConditionVariableWaitSetLatchPos	0
+
 /* Initially, we are not prepared to sleep on any condition variable. */
 static ConditionVariable *cv_sleep_target = NULL;
 
+/* Used by ConditionVariableSleep() and ConditionVariableTimedSleep(). */
+static WaitEventSet *ConditionVariableWaitSet = NULL;
+
+/*
+ * Initialize the process-local condition variable WaitEventSet.
+ *
+ * This must be called once during startup of any process that can wait on
+ * condition variables, before it issues any ConditionVariableInit() calls.
+ */
+void
+InitializeConditionVariableWaitSet(void)
+{
+	Assert(ConditionVariableWaitSet == NULL);
+
+	ConditionVariableWaitSet = ConditionVariableWaitSetCreate(
+		TopMemoryContext, 0);
+}
+
 /*
  * Initialize a condition variable.
  */
@@ -40,6 +60,51 @@ ConditionVariableInit(ConditionVariable *cv)
 	proclist_init(&cv->wakeup);
 }
 
+/*
+ * Create a WaitEventSet for ConditionVariableEventSleep(). This should be
+ * used when the caller of ConditionVariableEventSleep() would like to wake up
+ * on either the condition variable signal or a socket event. For example:
+ *
+ *   ConditionVariableInit(&cv);
+ *   waitset = ConditionVariableWaitSetCreate(mcxt, 1);
+ *   event_pos = AddWaitEventToSet(waitset, 0, sock, NULL, NULL);
+ *   ...
+ *   ConditionVariablePrepareToSleep(&cv);
+ *   while (...condition not met...)
+ *   {
+ *       socket_wait_events = ...
+ *       ModifyWaitEvent(waitset, event_pos, socket_wait_events, NULL);
+ *       ConditionVariableEventSleep(&cv, waitset, ...);
+ *   }
+ *   ConditionVariableCancelSleep();
+ *
+ * The waitset is created with the standard events for a condition variable,
+ * and room for adding n_socket_events additional socket events. The
+ * initially-filled event positions should not be modified, but added socket
+ * events can be modified. The same waitset can be used for multiple condition
+ * variables as long as the callers of ConditionVariableEventSleep() are
+ * interested in the same sockets.
+ */
+WaitEventSet *
+ConditionVariableWaitSetCreate(MemoryContext mcxt, int n_socket_events)
+{
+	int				 latch_pos   PG_USED_FOR_ASSERTS_ONLY;
+	int				 n_cv_events = IsUnderPostmaster ? 2 : 1;
+	int				 nevents	 = n_cv_events + n_socket_events;
+	WaitEventSet    *waitset	 = CreateWaitEventSet(mcxt, nevents);
+
+	latch_pos = AddWaitEventToSet(waitset, WL_LATCH_SET, PGINVALID_SOCKET,
+								  MyLatch, NULL);
+
+	if (IsUnderPostmaster)
+		AddWaitEventToSet(waitset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+						  NULL, NULL);
+
+	Assert(latch_pos == ConditionVariableWaitSetLatchPos);
+
+	return waitset;
+}
+
 /*
  * Prepare to wait on a given condition variable.
  *
@@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
-	(void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+	(void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
+									   -1 /* no timeout */ ,
 									   wait_event_info);
 }
 
@@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 bool
 ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 							uint32 wait_event_info)
+{
+	return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
+									   wait_event_info);
+}
+
+/*
+ * Wait for a condition variable to be signaled, a timeout to be reached, or a
+ * socket event in the given waitset. The waitset must have been created by
+ * ConditionVariableWaitSetCreate().
+ *
+ * Returns true when timeout expires, otherwise returns false.
+ *
+ * See ConditionVariableSleep() for general usage.
+ */
+bool
+ConditionVariableEventSleep(ConditionVariable *cv, WaitEventSet *waitset,
+							long timeout, uint32 wait_event_info)
 {
 	long		cur_timeout = -1;
 	instr_time	start_time;
 	instr_time	cur_time;
-	int			wait_events;
 
 	/*
 	 * If the caller didn't prepare to sleep explicitly, then do so now and
@@ -147,24 +229,32 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		INSTR_TIME_SET_CURRENT(start_time);
 		Assert(timeout >= 0 && timeout <= INT_MAX);
 		cur_timeout = timeout;
-		wait_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
 	}
-	else
-		wait_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
 
 	while (true)
 	{
 		bool		done = false;
+		WaitEvent	cvEvent;
+		int			nevents;
 
 		/*
-		 * Wait for latch to be set.  (If we're awakened for some other
-		 * reason, the code below will cope anyway.)
+		 * Wait for latch to be set, or other events which will be handled
+		 * below.
 		 */
-		(void) WaitLatch(MyLatch, wait_events, cur_timeout, wait_event_info);
+		nevents = WaitEventSetWait(waitset, cur_timeout, &cvEvent,
+								   1, wait_event_info);
 
 		/* Reset latch before examining the state of the wait list. */
 		ResetLatch(MyLatch);
 
+		/*
+		 * If the wakeup was due to a socket event or postmaster death, then
+		 * we must return to the caller.
+		 */
+		if (nevents == 1 &&
+			(cvEvent.events & (WL_SOCKET_MASK | WL_POSTMASTER_DEATH)) != 0)
+			done = true;
+
 		/*
 		 * If this process has been taken out of the wait list, then we know
 		 * that it has been signaled by ConditionVariableSignal (or
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..ae4a7aecd4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -440,6 +440,9 @@ InitProcess(void)
 	OwnLatch(&MyProc->procLatch);
 	SwitchToSharedLatch();
 
+	/* Initialize process-local condition variable support */
+	InitializeConditionVariableWaitSet();
+
 	/* now that we have a proc, report wait events to shared memory */
 	pgstat_set_wait_event_storage(&MyProc->wait_event_info);
 
@@ -596,6 +599,9 @@ InitAuxiliaryProcess(void)
 	OwnLatch(&MyProc->procLatch);
 	SwitchToSharedLatch();
 
+	/* Initialize process-local condition variable support */
+	InitializeConditionVariableWaitSet();
+
 	/* now that we have a proc, report wait events to shared memory */
 	pgstat_set_wait_event_storage(&MyProc->wait_event_info);
 
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a604432126..0545044225 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "postmaster/postmaster.h"
+#include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 589bdd323c..94adb54b91 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -22,6 +22,7 @@
 #ifndef CONDITION_VARIABLE_H
 #define CONDITION_VARIABLE_H
 
+#include "storage/latch.h"
 #include "storage/proclist_types.h"
 #include "storage/spin.h"
 
@@ -42,9 +43,14 @@ typedef union ConditionVariableMinimallyPadded
 	char		pad[CV_MINIMAL_SIZE];
 } ConditionVariableMinimallyPadded;
 
+extern void InitializeConditionVariableWaitSet(void);
+
 /* Initialize a condition variable. */
 extern void ConditionVariableInit(ConditionVariable *cv);
 
+extern WaitEventSet *ConditionVariableWaitSetCreate(MemoryContext mcxt,
+													int n_socket_events);
+
 /*
  * To sleep on a condition variable, a process should use a loop which first
  * checks the condition, exiting the loop if it is met, and then calls
@@ -56,6 +62,10 @@ extern void ConditionVariableInit(ConditionVariable *cv);
 extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern bool ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 										uint32 wait_event_info);
+extern bool ConditionVariableEventSleep(ConditionVariable *cv,
+										WaitEventSet *cvEventSet,
+										long timeout,
+										uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);
 
 /*
-- 
2.34.1

#253

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Masahiko Sawada (#250)

Re: Minimal logical decoding on standbys

On Fri, 2023-03-31 at 01:31 +0900, Masahiko Sawada wrote:

I think that we don't need to change for the latter case as
WalSndWait() perfectly works. As for the former cases, since we need
to wait for CV, timeout, or socket writable we can use
ConditionVariableEventSleep().

For this patch series, I agree.

But if the ConditionVariableEventSleep() API is added, then I think we
should change the non-recovery case to use a CV as well for
consistency, and it would avoid the need for WalSndWakeup().

Regards,
Jeff Davis

#254

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#251)

7 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 3/31/23 6:33 AM, Andres Freund wrote:

Hi,

On 2023-03-30 18:23:41 +0200, Drouvot, Bertrand wrote:

On 3/30/23 9:04 AM, Andres Freund wrote:

I think this commit is ready to go. Unless somebody thinks differently, I
think I might push it tomorrow.

Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
once 0001 is committed).

Unfortunately I did find an issue doing a pre-commit review of the patch.

The patch adds VISIBILITYMAP_IS_CATALOG_REL to xl_heap_visible.flags - but it
does not remove the bit before calling visibilitymap_set().

This ends up corrupting the visibilitymap, because the we'll set a bit for
another page.

Oh I see, I did not think about that (not enough experience in the VM area).
Nice catch and thanks for pointing out!

On a casual read, one very well might think that VISIBILITYMAP_IS_CATALOG_REL
is a valid bit that could be set in the VM.

I see what you're saying now and do agree that's confusing.

I am thinking of instead creating a separate namespace for the "xlog only"
bits:

/*
* To detect recovery conflicts during logical decoding on a standby, we need
* to know if a table is a user catalog table. For that we add an additional
* bit into xl_heap_visible.flags, in addition to the above.
*
* NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
*/
#define VISIBILITYMAP_XLOG_CATALOG_REL 0x04
#define VISIBILITYMAP_XLOG_VALID_BITS (VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)

That allows heap_xlog_visible() to do:

Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
vmbits = (xlrec->flags & VISIBILITYMAP_VALID_BITS);

and pass vmbits istead of xlrec->flags to visibilitymap_set().

That sounds good to me. That way you'd ensure that VISIBILITYMAP_XLOG_CATALOG_REL is not
passed to visibilitymap_set().

I'm also thinking of splitting the patch into two. One patch to pass down the
heap relation into the new places, and another for the rest.

I think that makes sense. I don't know how far you've work on the split but please
find attached V54 doing such a split + implementing your VISIBILITYMAP_XLOG_VALID_BITS
suggestion.

Note that gistXLogDelete() continues to register data with two different
XLogRegisterData() calls. This will append data without any padding:

XLogRegisterData((char *) &xlrec, SizeOfGistxlogDelete);

/*
* We need the target-offsets array whether or not we store the whole
* buffer, to allow us to find the snapshotConflictHorizon on a standby
* server.
*/
XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));

But replay now uses the new offset member:
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
Buffer		buffer;
Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
/*
* If we have any conflict processing to do, it must happen before we
That doesn't look right. If there's any padding before offsets, we'll afaict
read completely bogus data?

As it turns out, there is padding:

struct gistxlogDelete {
TransactionId snapshotConflictHorizon; /* 0 4 */
uint16 ntodelete; /* 4 2 */
_Bool isCatalogRel; /* 6 1 */

/* XXX 1 byte hole, try to pack */

OffsetNumber offsets[]; /* 8 0 */

/* size: 8, cachelines: 1, members: 4 */
/* sum members: 7, holes: 1, sum holes: 1 */
/* last cacheline: 8 bytes */
};

I am frankly baffled how this works at all, this should just about immediately
crash?

Oh, I see. Hm, don't we have already the same issue for spgxlogVacuumRoot / vacuumLeafRoot() / spgRedoVacuumRoot()?

I'm not going to commit a nontrivial change to these WAL records without some
minimal tests.

That makes fully sense.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v54-0007-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v54-0007-Doc-changes-describing-details-about-logical-dec.patchDownload

From 84faca71fa036d1c546f15523c942b33ed41f44d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v54 7/7] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v54-0006-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v54-0006-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 3fb8a3f8348fb7224137af1e8c02b63cb678a0a4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v54 6/7] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 38e7f46760..f15d647eac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26954,6 +26954,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..8c45180c35
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v54-0005-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v54-0005-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From d788db165c827acbecd34f90c96e6b9b837b950e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v54 5/7] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a05ecdeddd..2a0d6c787d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1551,6 +1551,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1565,10 +1566,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1652,21 +1658,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 7940d64639..27fc8fab76 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..c8842d94bb 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -14,6 +14,7 @@
 #include "access/xlogreader.h"
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
+#include "storage/condition_variable.h"
 #include "utils/timestamp.h"
 
 /*
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 518d3b0a1f..e13abdc365 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v54-0004-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v54-0004-Allow-logical-decoding-on-standby.patchDownload

From 07d387d501bbad71496535c146cf69f357273250 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v54 4/7] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b85804633..f0adf51a9b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8457eec4c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..743d12ba14 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +341,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5ff843047c..b7599808fc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1176,37 +1177,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1222,6 +1214,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..a05ecdeddd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3073,10 +3081,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3179,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v54-0003-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v54-0003-Handle-logical-slot-conflicts-on-standby.patchDownload

From 8a1b7ed458d750b87ef43484073a639152fa712b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v54 3/7] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  20 +-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 189 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |  16 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |  14 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 308 insertions(+), 61 deletions(-)
   5.3% doc/src/sgml/
   6.2% src/backend/access/transam/
   4.6% src/backend/replication/logical/
  55.6% src/backend/replication/
   4.4% src/backend/storage/ipc/
   6.9% src/backend/tcop/
   5.3% src/backend/
   3.8% src/include/catalog/
   5.3% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d5a45f996d..87fd10401d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 08546b35c7..3dc638a801 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9007,6 +9009,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..4b85804633 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..3a8088ac03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..5ff843047c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = SlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1226,20 +1225,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1247,6 +1247,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		StringInfoData err_msg;
+		StringInfoData err_detail;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1268,23 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
-
-		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
-		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* the slot has been invalidated (logical decoding conflict case) */
+		if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
+		/* or the xid is valid and this is a non conflicting slot */
+					 (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin, slot_catalog_xmin, *xid))))) ||
+		/* or the slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1303,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,16 +1345,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
+				bool		send_signal = false;
+
+				initStringInfo(&err_msg);
+				initStringInfo(&err_detail);
+
+				appendStringInfo(&err_msg, "terminating process %d to release replication slot \"%s\"",
+								 active_pid,
+								 NameStr(slotname));
+				if (xid)
+				{
+					appendStringInfo(&err_msg, " because it conflicts with recovery");
+					send_signal = true;
+
+					if (TransactionIdIsValid(*xid))
+						appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+					else
+						appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+				}
+				else
+				{
+					appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									 LSN_FORMAT_ARGS(restart_lsn),
+									 (unsigned long long) (oldestLSN - restart_lsn));
+				}
+
 				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+						errmsg("%s", err_msg.data),
+						errdetail("%s", err_detail.data),
+						send_signal ? 0 : errhint("You might need to increase max_slot_wal_keep_size."));
+
+				if (send_signal)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
+
+				pfree(err_msg.data);
+				pfree(err_detail.data);
 			}
 
 			/* Wait until the slot is released. */
@@ -1355,6 +1400,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		}
 		else
 		{
+			bool		hint = false;;
+
+			initStringInfo(&err_msg);
+			initStringInfo(&err_detail);
+
 			/*
 			 * We hold the slot now and have already invalidated it; flush it
 			 * to ensure that state persists.
@@ -1370,14 +1420,37 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			appendStringInfo(&err_msg, "invalidating");
+
+			if (xid)
+			{
+				if (TransactionIdIsValid(*xid))
+					appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+				else
+					appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+			}
+			else
+			{
+				appendStringInfo(&err_msg, " obsolete replication");
+				appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								 LSN_FORMAT_ARGS(restart_lsn),
+								 (unsigned long long) (oldestLSN - restart_lsn));
+				hint = true;
+			}
+
+			appendStringInfo(&err_msg, " slot \"%s\" because it conflicts with recovery",
+							 NameStr(slotname));
+
 
 			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+					errmsg("%s", err_msg.data),
+					errdetail("%s", err_detail.data),
+					hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+			pfree(err_msg.data);
+			pfree(err_detail.data);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1463,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
+ *
+ * Obsolete case (aka xid is NULL):
  *
- * Returns true when any slot have got invalidated.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1503,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..4207d14ae0 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
@@ -404,6 +403,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..e1c45eb2ad 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2488,6 +2488,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..013cd2b4d0 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,17 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define ObsoleteSlotIsInvalid(s) (!XLogRecPtrIsInvalid(s->data.invalidated_at) && \
+								  XLogRecPtrIsInvalid(s->data.restart_lsn))
+
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											!TransactionIdIsValid(s->data.catalog_xmin))
+
+#define SlotIsInvalid(s) (ObsoleteSlotIsInvalid(s) || LogicalReplicationSlotIsInvalid (s))
+
+#define LogicalReplicationSlotXidsConflict(slot_xmin, catalog_xmin, xid) \
+		((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) || \
+		(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid)))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +226,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +238,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v54-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v54-0002-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From 983a660f8847d5f7ed6359c0718d58ae3e9c120b Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Fri, 31 Mar 2023 08:21:16 +0000
Subject: [PATCH v54 2/7] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 src/backend/access/gist/gistxlog.c     | 12 ++++--------
 src/backend/access/hash/hash_xlog.c    | 12 +++---------
 src/backend/access/hash/hashinsert.c   |  1 +
 src/backend/access/heap/heapam.c       |  9 ++++++++-
 src/backend/access/heap/pruneheap.c    |  1 +
 src/backend/access/nbtree/nbtpage.c    |  2 ++
 src/backend/access/spgist/spgvacuum.c  |  1 +
 src/include/access/gistxlog.h          | 11 ++++++++---
 src/include/access/hash_xlog.h         |  8 +++++---
 src/include/access/heapam_xlog.h       |  8 ++++++--
 src/include/access/nbtxlog.h           |  8 ++++++--
 src/include/access/spgxlog.h           |  2 ++
 src/include/access/visibilitymapdefs.h | 17 +++++++++++++----
 13 files changed, 60 insertions(+), 32 deletions(-)
  10.1% src/backend/access/gist/
   8.6% src/backend/access/hash/
  10.4% src/backend/access/heap/
   3.2% src/backend/access/nbtree/
  65.9% src/include/access/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 4b52719765..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -609,6 +603,7 @@ gistXLogPageReuse(Relation rel, Relation heaprel,
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -678,6 +673,7 @@ gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9662e38254..08546b35c7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6698,6 +6698,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8280,6 +8281,8 @@ log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_XLOG_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
@@ -8870,6 +8873,8 @@ heap_xlog_visible(XLogReaderState *record)
 	BlockNumber blkno;
 	XLogRedoAction action;
 
+	Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
+
 	XLogRecGetBlockTag(record, 1, &rlocator, NULL, &blkno);
 
 	/*
@@ -8956,6 +8961,7 @@ heap_xlog_visible(XLogReaderState *record)
 	{
 		Page		vmpage = BufferGetPage(vmbuffer);
 		Relation	reln;
+		uint8		vmbits;
 
 		/* initialize the page if it was read as zeros */
 		if (PageIsNew(vmpage))
@@ -8969,9 +8975,10 @@ heap_xlog_visible(XLogReaderState *record)
 
 		reln = CreateFakeRelcacheEntry(rlocator);
 		visibilitymap_pin(reln, blkno, &vmbuffer);
+		vmbits = (xlrec->flags & VISIBILITYMAP_VALID_BITS);
 
 		visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
-						  xlrec->snapshotConflictHorizon, xlrec->flags);
+						  xlrec->snapshotConflictHorizon, vmbits);
 
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ee996b5660..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3cff71e720..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 2ce9366277..93fb9d438a 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,11 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -98,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 9894ab9afe..6c5535fe73 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16			ntuples;
+	bool        isCatalogRel;   /* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42620bbdc9..08db7e62dd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,13 +346,15 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*
 	 * In payload of blk 0 : FREEZE PLANS and OFFSET NUMBER ARRAY
 	 */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage	(offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7dd67257f2..823c700dee 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,6 +237,8 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*----
 	 * In payload of blk 0 :
@@ -245,7 +249,7 @@ typedef struct xl_btree_delete
 	 */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..1fe1fe092d 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,18 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+/*
+ * To detect recovery conflicts during logical decoding on a standby, we need
+ * to know if a table is a user catalog table. For that we add an additional
+ * bit into xl_heap_visible.flags, in addition to the above.
+ *
+ * NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
+ */
+#define VISIBILITYMAP_XLOG_CATALOG_REL	0x04
+#define VISIBILITYMAP_XLOG_VALID_BITS	(VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)
 
 #endif							/* VISIBILITYMAPDEFS_H */
-- 
2.34.1

v54-0001-Pass-down-the-heap-relation-into-the-new-places.patchtext/plain; charset=UTF-8; name=v54-0001-Pass-down-the-heap-relation-into-the-new-places.patchDownload

From 6610ec59ed4752bf39738dd3ad056cd1c830bc53 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:55:19 +0000
Subject: [PATCH v54 1/7] Pass down the heap relation into the new places

In preparation for logical decoding on standby and also being able to pass
the heap relation to GlobalVisTestFor() vacuumRedirectAndPlaceholder().
---
 contrib/amcheck/verify_nbtree.c            |  15 +--
 src/backend/access/gist/gist.c             |   5 +-
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/gist/gistutil.c         |   4 +-
 src/backend/access/gist/gistxlog.c         |   5 +-
 src/backend/access/heap/heapam.c           |   2 +-
 src/backend/access/heap/heapam_handler.c   |   9 +-
 src/backend/access/heap/vacuumlazy.c       |   2 +
 src/backend/access/heap/visibilitymap.c    |   3 +-
 src/backend/access/nbtree/nbtinsert.c      |  91 +++++++++--------
 src/backend/access/nbtree/nbtpage.c        | 109 +++++++++++----------
 src/backend/access/nbtree/nbtree.c         |   4 +-
 src/backend/access/nbtree/nbtsearch.c      |  50 +++++-----
 src/backend/access/nbtree/nbtsort.c        |   2 +-
 src/backend/access/nbtree/nbtutils.c       |   7 +-
 src/backend/access/spgist/spgvacuum.c      |   8 +-
 src/backend/catalog/index.c                |   1 +
 src/backend/commands/analyze.c             |   1 +
 src/backend/commands/vacuumparallel.c      |   6 ++
 src/backend/optimizer/util/plancat.c       |   2 +-
 src/backend/utils/sort/tuplesortvariants.c |   5 +-
 src/include/access/genam.h                 |   1 +
 src/include/access/gist_private.h          |   7 +-
 src/include/access/heapam_xlog.h           |   2 +-
 src/include/access/nbtree.h                |  37 ++++---
 src/include/utils/rel.h                    |   1 +
 src/include/utils/tuplesort.h              |   4 +-
 27 files changed, 215 insertions(+), 170 deletions(-)
   3.9% contrib/amcheck/
   3.4% src/backend/access/gist/
   3.8% src/backend/access/heap/
  69.7% src/backend/access/nbtree/
   4.8% src/backend/
  13.2% src/include/access/

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 257cff671b..eb280d4893 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,6 +183,7 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 												OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													Relation heaprel,
 													IndexTuple itup);
 static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
@@ -331,7 +332,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Extract metadata from metapage, and sanitize it in passing */
-		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
+		_bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
 		if (allequalimage && !heapkeyspace)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1258,7 +1259,7 @@ bt_target_page_check(BtreeCheckState *state)
 		}
 
 		/* Build insertion scankey for current page offset */
-		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
 
 		/*
 		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1768,7 +1769,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return bt_mkscankey_pivotsearch(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
 }
 
 /*
@@ -2681,7 +2682,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, state->heaprel, itup);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2694,7 +2695,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	 */
 	Assert(state->readonly && state->rootdescend);
 	exists = false;
-	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+	stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
 
 	if (BufferIsValid(lbuf))
 	{
@@ -3133,11 +3134,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * the scankey is greater.
  */
 static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, heaprel, itup);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ea72bcce1b..c3a3d49bca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -349,7 +349,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(rel);
+			ptr->buffer = gistNewBuffer(rel, heapRel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -1695,7 +1695,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 
 			recptr = gistXLogDelete(buffer,
 									deletable, ndeletable,
-									snapshotConflictHorizon);
+									snapshotConflictHorizon,
+									heapRel);
 
 			PageSetLSN(page, recptr);
 		}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 57fe553a5c..d2f8da5b02 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		Page		page;
 
 		/* initialize the root page */
-		buffer = gistNewBuffer(index);
+		buffer = gistNewBuffer(index, heap);
 		Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
 		page = BufferGetPage(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff..a607464b97 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
  * Caller is responsible for initializing the page by calling GISTInitBuffer
  */
 Buffer
-gistNewBuffer(Relation r)
+gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
 	bool		needLock;
@@ -865,7 +865,7 @@ gistNewBuffer(Relation r)
 				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
-					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+					gistXLogPageReuse(r, heaprel, blkno, GistPageGetDeleteXid(page));
 
 				return buffer;
 			}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f65864254a..4b52719765 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -597,7 +597,8 @@ gistXLogAssignLSN(void)
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId deleteXid)
+gistXLogPageReuse(Relation rel, Relation heaprel,
+				  BlockNumber blkno, FullTransactionId deleteXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -672,7 +673,7 @@ gistXLogUpdate(Buffer buffer,
  */
 XLogRecPtr
 gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
 {
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8abc101c8c..9662e38254 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8268,7 +8268,7 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
  * update the heap page's LSN.
  */
 XLogRecPtr
-log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
+log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 				 TransactionId snapshotConflictHorizon, uint8 vmflags)
 {
 	xl_heap_visible xlrec;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9e690074e9..97b5daee92 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -822,9 +822,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 *multi_cutoff);
 
 
-	/* Set up sorting if wanted */
+	/*
+	 * Set up sorting if wanted. NewHeap is being passed to
+	 * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
+	 * really matter, as the goal is to have a heap relation being passed to
+	 * _bt_log_reuse_page() (which should not be called from this code path).
+	 */
 	if (use_sort)
-		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
+		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
 											maintenance_work_mem,
 											NULL, TUPLESORT_NONE);
 	else
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..ae628d747d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2710,6 +2710,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
@@ -2759,6 +2760,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.heaprel = vacrel->rel;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb17..d1ba859851 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -288,8 +288,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 			if (XLogRecPtrIsInvalid(recptr))
 			{
 				Assert(!InRecovery);
-				recptr = log_heap_visible(rel->rd_locator, heapBuf, vmBuf,
-										  cutoff_xid, flags);
+				recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags);
 
 				/*
 				 * If data checksums are enabled (or wal_log_hints=on), we
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f4c1a974ef..8c6e867c61 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,8 @@
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
 
-static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
+static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+								 BTInsertState insertstate);
 static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 									  Relation heapRel,
 									  IndexUniqueCheck checkUnique, bool *is_unique,
@@ -41,8 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  bool indexUnchanged,
 									  BTStack stack,
 									  Relation heapRel);
-static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+static void _bt_stepright(Relation rel, Relation heaprel,
+						  BTInsertState insertstate, BTStack stack);
+static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
@@ -51,13 +53,13 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   OffsetNumber newitemoff,
 						   int postingoff,
 						   bool split_only_page);
-static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
-						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem, IndexTuple orignewitem,
+static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+						Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
+						Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
 						IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-							  BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
+static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+							  Buffer rbuf, BTStack stack, bool isroot, bool isonly);
+static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
 static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 								OffsetNumber itup_off, bool newfirstdataitem);
 static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -108,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, heapRel, itup);
 
 	if (checkingunique)
 	{
@@ -162,7 +164,7 @@ search:
 	 * searching from the root page.  insertstate.buf will hold a buffer that
 	 * is locked in exclusive mode afterwards.
 	 */
-	stack = _bt_search_insert(rel, &insertstate);
+	stack = _bt_search_insert(rel, heapRel, &insertstate);
 
 	/*
 	 * checkingunique inserts are not allowed to go ahead when two tuples with
@@ -255,8 +257,8 @@ search:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   indexUnchanged, stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, insertstate.itemsz, newitemoff,
+		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
 	}
 	else
@@ -312,7 +314,7 @@ search:
  * since each per-backend cache won't stay valid for long.
  */
 static BTStack
-_bt_search_insert(Relation rel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 {
 	Assert(insertstate->buf == InvalidBuffer);
 	Assert(!insertstate->bounds_valid);
@@ -375,8 +377,8 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 	}
 
 	/* Cannot use optimization -- descend tree, return proper descent stack */
-	return _bt_search(rel, insertstate->itup_key, &insertstate->buf, BT_WRITE,
-					  NULL);
+	return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+					  BT_WRITE, NULL);
 }
 
 /*
@@ -885,7 +887,7 @@ _bt_findinsertloc(Relation rel,
 					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
 					break;
 
-				_bt_stepright(rel, insertstate, stack);
+				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				opaque = BTPageGetOpaque(page);
@@ -969,7 +971,7 @@ _bt_findinsertloc(Relation rel,
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
-			_bt_stepright(rel, insertstate, stack);
+			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
 			page = BufferGetPage(insertstate->buf);
 			opaque = BTPageGetOpaque(page);
@@ -1022,7 +1024,7 @@ _bt_findinsertloc(Relation rel,
  * indexes.
  */
 static void
-_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1048,7 +1050,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 		 */
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, rbuf, stack);
+			_bt_finish_split(rel, heaprel, rbuf, stack);
 			rbuf = InvalidBuffer;
 			continue;
 		}
@@ -1099,6 +1101,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   Relation heaprel,
 			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
@@ -1209,8 +1212,8 @@ _bt_insertonpg(Relation rel,
 		Assert(!split_only_page);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
-						 origitup, nposting, postingoff);
+		rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+						 itup, origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1233,7 +1236,7 @@ _bt_insertonpg(Relation rel,
 		 * page.
 		 *----------
 		 */
-		_bt_insert_parent(rel, buf, rbuf, stack, isroot, isonly);
+		_bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
 	}
 	else
 	{
@@ -1254,7 +1257,7 @@ _bt_insertonpg(Relation rel,
 			Assert(!isleaf);
 			Assert(BufferIsValid(cbuf));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -1418,7 +1421,7 @@ _bt_insertonpg(Relation rel,
 		 * call _bt_getrootheight while holding a buffer lock.
 		 */
 		if (BlockNumberIsValid(blockcache) &&
-			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
+			_bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, blockcache);
 	}
 
@@ -1459,8 +1462,8 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
@@ -1712,7 +1715,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * way because it avoids an unnecessary PANIC when either origpage or its
 	 * existing sibling page are corrupt.
 	 */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rightpage = BufferGetPage(rbuf);
 	rightpagenumber = BufferGetBlockNumber(rbuf);
 	/* rightpage was initialized by _bt_getbuf */
@@ -1885,7 +1888,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	if (!isrightmost)
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
 		spage = BufferGetPage(sbuf);
 		sopaque = BTPageGetOpaque(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -2092,6 +2095,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
  */
 static void
 _bt_insert_parent(Relation rel,
+				  Relation heaprel,
 				  Buffer buf,
 				  Buffer rbuf,
 				  BTStack stack,
@@ -2118,7 +2122,7 @@ _bt_insert_parent(Relation rel,
 		Assert(stack == NULL);
 		Assert(isonly);
 		/* create a new root node and update the metapage */
-		rootbuf = _bt_newroot(rel, buf, rbuf);
+		rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
 		/* release the split buffers */
 		_bt_relbuf(rel, rootbuf);
 		_bt_relbuf(rel, rbuf);
@@ -2157,7 +2161,8 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
+									NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2183,7 +2188,7 @@ _bt_insert_parent(Relation rel,
 		 * new downlink will be inserted at the correct offset. Even buf's
 		 * parent may have changed.
 		 */
-		pbuf = _bt_getstackbuf(rel, stack, bknum);
+		pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
 
 		/*
 		 * Unlock the right child.  The left child will be unlocked in
@@ -2207,7 +2212,7 @@ _bt_insert_parent(Relation rel,
 									 RelationGetRelationName(rel), bknum, rbknum)));
 
 		/* Recursively insert into the parent */
-		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, MAXALIGN(IndexTupleSize(new_item)),
 					   stack->bts_offset + 1, 0, isonly);
 
@@ -2227,7 +2232,7 @@ _bt_insert_parent(Relation rel,
  * and unpinned.
  */
 void
-_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
 {
 	Page		lpage = BufferGetPage(lbuf);
 	BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2240,7 +2245,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
 	rpage = BufferGetPage(rbuf);
 	rpageop = BTPageGetOpaque(rpage);
 
@@ -2252,7 +2257,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -2269,7 +2274,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	elog(DEBUG1, "finishing incomplete split of %u/%u",
 		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
 
-	_bt_insert_parent(rel, lbuf, rbuf, stack, wasroot, wasonly);
+	_bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
 }
 
 /*
@@ -2304,7 +2309,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
  *		offset number bts_offset + 1.
  */
 Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
 {
 	BlockNumber blkno;
 	OffsetNumber start;
@@ -2318,13 +2323,13 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_INCOMPLETE_SPLIT(opaque))
 		{
-			_bt_finish_split(rel, buf, stack->bts_parent);
+			_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
 			continue;
 		}
 
@@ -2428,7 +2433,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
  *		lbuf, rbuf & rootbuf.
  */
 static Buffer
-_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
+_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
 {
 	Buffer		rootbuf;
 	Page		lpage,
@@ -2454,12 +2459,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = BTPageGetOpaque(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d19..ee996b5660 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,25 +38,24 @@
 #include "utils/snapmgr.h"
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 							   FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 								TransactionId snapshotConflictHorizon,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
 								 OffsetNumber *updatedoffsets,
 								 Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
-								   BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+								   Buffer leafbuf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
-									BTStack stack,
-									Buffer *subtreeparent,
-									OffsetNumber *poffset,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+									BlockNumber child, BTStack stack,
+									Buffer *subtreeparent, OffsetNumber *poffset,
 									BlockNumber *topparent,
 									BlockNumber *topparentrightsib);
 static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
@@ -178,7 +177,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  * index tuples needed to be deleted.
  */
 bool
-_bt_vacuum_needs_cleanup(Relation rel)
+_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -191,7 +190,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
 	 *
 	 * Note that we deliberately avoid using cached version of metapage here.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 	btm_version = metad->btm_version;
@@ -231,7 +230,7 @@ _bt_vacuum_needs_cleanup(Relation rel)
  * finalized.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -255,7 +254,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
 	 * to be consistent.
 	 */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
@@ -340,7 +339,7 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
  *		The metadata page is not locked or pinned on exit.
  */
 Buffer
-_bt_getroot(Relation rel, int access)
+_bt_getroot(Relation rel, Relation heaprel, int access)
 {
 	Buffer		metabuf;
 	Buffer		rootbuf;
@@ -370,7 +369,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -396,7 +395,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metad = _bt_getmeta(rel, metabuf);
 
 	/* if no root page initialized yet, do it */
@@ -429,7 +428,7 @@ _bt_getroot(Relation rel, int access)
 			 * to optimize this case.)
 			 */
 			_bt_relbuf(rel, metabuf);
-			return _bt_getroot(rel, access);
+			return _bt_getroot(rel, heaprel, access);
 		}
 
 		/*
@@ -437,7 +436,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
@@ -574,7 +573,7 @@ _bt_getroot(Relation rel, int access)
  * moving to the root --- that'd deadlock against any concurrent root split.)
  */
 Buffer
-_bt_gettrueroot(Relation rel)
+_bt_gettrueroot(Relation rel, Relation heaprel)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -596,7 +595,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = BTPageGetOpaque(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -669,7 +668,7 @@ _bt_gettrueroot(Relation rel)
  *		about updating previously cached data.
  */
 int
-_bt_getrootheight(Relation rel)
+_bt_getrootheight(Relation rel, Relation heaprel)
 {
 	BTMetaPageData *metad;
 
@@ -677,7 +676,7 @@ _bt_getrootheight(Relation rel)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -733,7 +732,7 @@ _bt_getrootheight(Relation rel)
  *		pg_upgrade'd from Postgres 12.
  */
 void
-_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -741,7 +740,7 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 	{
 		Buffer		metabuf;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
 		metad = _bt_getmeta(rel, metabuf);
 
 		/*
@@ -825,7 +824,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
+				   FullTransactionId safexid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -868,7 +868,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid)
  *		as _bt_lockbuf().
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
 
@@ -943,7 +943,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
-						_bt_log_reuse_page(rel, blkno,
+						_bt_log_reuse_page(rel, heaprel, blkno,
 										   BTPageGetDeleteXid(page));
 
 					/* Okay to use page.  Re-initialize and return it. */
@@ -1293,7 +1293,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * clear page's VACUUM cycle ID.
  */
 static void
-_bt_delitems_delete(Relation rel, Buffer buf,
+_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 					TransactionId snapshotConflictHorizon,
 					OffsetNumber *deletable, int ndeletable,
 					BTVacuumPosting *updatable, int nupdatable)
@@ -1684,8 +1684,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Physically delete tuples (or TIDs) using deletable (or updatable) */
-	_bt_delitems_delete(rel, buf, snapshotConflictHorizon,
-						deletable, ndeletable, updatable, nupdatable);
+	_bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
+						ndeletable, updatable, nupdatable);
 
 	/* be tidy */
 	for (int i = 0; i < nupdatable; i++)
@@ -1706,7 +1706,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
  * same level must always be locked left to right to avoid deadlocks.
  */
 static bool
-_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
+					  BlockNumber target)
 {
 	Buffer		buf;
 	Page		page;
@@ -1717,7 +1718,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
 	if (leftsib == P_NONE)
 		return false;
 
-	buf = _bt_getbuf(rel, leftsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1763,7 +1764,7 @@ _bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
  * to-be-deleted subtree.)
  */
 static bool
-_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
 {
 	Buffer		buf;
 	Page		page;
@@ -1772,7 +1773,7 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
 
 	Assert(leafrightsib != P_NONE);
 
-	buf = _bt_getbuf(rel, leafrightsib, BT_READ);
+	buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 
@@ -1961,17 +1962,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				 * marked with INCOMPLETE_SPLIT flag before proceeding
 				 */
 				Assert(leafblkno == scanblkno);
-				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
+				if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
 					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
+				stack = _bt_search(rel, vstate->info->heaprel, itup_key,
+								   &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
@@ -2002,7 +2004,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 			 * leafbuf page half-dead.
 			 */
 			Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
-			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
+			if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
 				return;
@@ -2065,7 +2067,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 		if (!rightsib_empty)
 			break;
 
-		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	}
 }
 
@@ -2084,7 +2086,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
  * successfully.
  */
 static bool
-_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+					   BTStack stack)
 {
 	BlockNumber leafblkno;
 	BlockNumber leafrightsib;
@@ -2119,7 +2122,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 * delete the downlink.  It would fail the "right sibling of target page
 	 * is also the next child in parent page" cross-check below.
 	 */
-	if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
+	if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
 	{
 		elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
 			 leafblkno, leafrightsib);
@@ -2143,7 +2146,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	 */
 	topparent = leafblkno;
 	topparentrightsib = leafrightsib;
-	if (!_bt_lock_subtree_parent(rel, leafblkno, stack,
+	if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
 								 &subtreeparent, &poffset,
 								 &topparent, &topparentrightsib))
 		return false;
@@ -2363,7 +2366,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		Assert(target != leafblkno);
 
 		/* Fetch the block number of the target's left sibling */
-		buf = _bt_getbuf(rel, target, BT_READ);
+		buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		leftsib = opaque->btpo_prev;
@@ -2390,7 +2393,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_lockbuf(rel, leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 		page = BufferGetPage(lbuf);
 		opaque = BTPageGetOpaque(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2440,7 +2443,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			CHECK_FOR_INTERRUPTS();
 
 			/* step right one page */
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
 			opaque = BTPageGetOpaque(page);
 		}
@@ -2504,7 +2507,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
 	page = BufferGetPage(rbuf);
 	opaque = BTPageGetOpaque(page);
 	if (opaque->btpo_prev != target)
@@ -2533,7 +2536,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
+								 BT_WRITE);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -2773,9 +2777,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
  * parent block in the leafbuf page using BTreeTupleSetTopParent()).
  */
 static bool
-_bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
-						Buffer *subtreeparent, OffsetNumber *poffset,
-						BlockNumber *topparent, BlockNumber *topparentrightsib)
+_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+						BTStack stack, Buffer *subtreeparent,
+						OffsetNumber *poffset, BlockNumber *topparent,
+						BlockNumber *topparentrightsib)
 {
 	BlockNumber parent,
 				leftsibparent;
@@ -2789,7 +2794,7 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 * Locate the pivot tuple whose downlink points to "child".  Write lock
 	 * the parent page itself.
 	 */
-	pbuf = _bt_getstackbuf(rel, stack, child);
+	pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
 	if (pbuf == InvalidBuffer)
 	{
 		/*
@@ -2889,11 +2894,11 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
 	 *
 	 * Note: We deliberately avoid completing incomplete splits here.
 	 */
-	if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
+	if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
 		return false;
 
 	/* Recurse to examine child page's grandparent page */
-	return _bt_lock_subtree_parent(rel, parent, stack->bts_parent,
+	return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
 								   subtreeparent, poffset,
 								   topparent, topparentrightsib);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a68dd07534..97a39b0f65 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -835,7 +835,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
-		if (!_bt_vacuum_needs_cleanup(info->index))
+		if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
 			return NULL;
 
 		/*
@@ -871,7 +871,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages);
+	_bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c43c1a2830..5c728e353d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,7 +42,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
+							Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -93,14 +94,14 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
-		   Snapshot snapshot)
+_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
 
 	/* Get the root page to start with */
-	*bufP = _bt_getroot(rel, access);
+	*bufP = _bt_getroot(rel, heaprel, access);
 
 	/* If index is empty and access = BT_READ, no root page is created. */
 	if (!BufferIsValid(*bufP))
@@ -129,8 +130,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * also taken care of in _bt_getstackbuf).  But this is a good
 		 * opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
-							  page_access, snapshot);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+							  stack_in, page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -190,7 +191,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
 							  snapshot);
 	}
 
@@ -234,6 +235,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  Relation heaprel,
 			  BTScanInsert key,
 			  Buffer buf,
 			  bool forupdate,
@@ -288,12 +290,12 @@ _bt_moveright(Relation rel,
 			}
 
 			if (P_INCOMPLETE_SPLIT(opaque))
-				_bt_finish_split(rel, buf, stack);
+				_bt_finish_split(rel, heaprel, buf, stack);
 			else
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, heaprel, blkno, access);
 			continue;
 		}
 
@@ -860,6 +862,7 @@ bool
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
+	Relation 	heaprel = scan->heapRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
@@ -1352,7 +1355,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	_bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1363,7 +1366,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+	stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -2004,7 +2007,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = BTPageGetOpaque(page);
@@ -2078,7 +2081,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 		if (BTScanPosIsPinned(so->currPos))
 			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
 		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
+										 so->currPos.currPage, BT_READ);
 
 		for (;;)
 		{
@@ -2092,8 +2096,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
-											scan->xs_snapshot);
+			so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
+											so->currPos.buf, scan->xs_snapshot);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -2140,7 +2144,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
+											 BT_READ);
 			}
 		}
 	}
@@ -2185,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -2213,7 +2218,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
 		page = BufferGetPage(buf);
 		TestForOldSnapshot(snapshot, rel, page);
 		opaque = BTPageGetOpaque(page);
@@ -2304,7 +2309,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
  * The returned buffer is pinned and read-locked.
  */
 Buffer
-_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
 				 Snapshot snapshot)
 {
 	Buffer		buf;
@@ -2320,9 +2325,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 	 * smarter about intermediate levels.)
 	 */
 	if (level == 0)
-		buf = _bt_getroot(rel, BT_READ);
+		buf = _bt_getroot(rel, heaprel, BT_READ);
 	else
-		buf = _bt_gettrueroot(rel);
+		buf = _bt_gettrueroot(rel, heaprel);
 
 	if (!BufferIsValid(buf))
 		return InvalidBuffer;
@@ -2403,7 +2408,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
+	buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
+						   ScanDirectionIsBackward(dir), scan->xs_snapshot);
 
 	if (!BufferIsValid(buf))
 	{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 02b9601bec..1207a49689 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd..05abf36032 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
 {
 	BTScanInsert key;
 	ScanKey		skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	if (itup)
-		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+		_bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
 	else
 	{
 		/* Utility statement callers can set these fields themselves */
@@ -1761,7 +1761,8 @@ _bt_killitems(IndexScanDesc scan)
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
-		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
+						 so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
 		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3adb18f2d8..3cff71e720 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -489,7 +489,7 @@ vacuumLeafRoot(spgBulkDeleteState *bds, Relation index, Buffer buffer)
  * Unlike the routines above, this works on both leaf and inner pages.
  */
 static void
-vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
+vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
 	SpGistPageOpaque opaque = SpGistPageGetOpaque(page);
@@ -643,13 +643,13 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 		else
 		{
 			vacuumLeafPage(bds, index, buffer, false);
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 		}
 	}
 	else
 	{
 		/* inner page */
-		vacuumRedirectAndPlaceholder(index, buffer);
+		vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 	}
 
 	/*
@@ -719,7 +719,7 @@ spgprocesspending(spgBulkDeleteState *bds)
 			/* deal with any deletable tuples */
 			vacuumLeafPage(bds, index, buffer, true);
 			/* might as well do this while we are here */
-			vacuumRedirectAndPlaceholder(index, buffer);
+			vacuumRedirectAndPlaceholder(index, bds->info->heaprel, buffer);
 
 			SpGistSetLastUsedPage(index, buffer);
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6aec1b1bca..ce0acf5890 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3364,6 +3364,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.heaprel = heapRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 65750958bb..0178186d38 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -712,6 +712,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.heaprel = onerel;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..2cdbd182b6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -148,6 +148,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Parent Heap Relation */
+	Relation	heaprel;
+
 	/* Target indexes */
 	Relation   *indrels;
 	int			nindexes;
@@ -266,6 +269,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->nindexes = nindexes;
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
+	pvs->heaprel = rel;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -838,6 +842,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.heaprel = pvs->heaprel;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
@@ -1007,6 +1012,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.dead_items = dead_items;
 	pvs.relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
+	pvs.heaprel = rel;
 
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d58c4a1078..e3824efe9b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					 * For btrees, get tree height while we have the index
 					 * open
 					 */
-					info->tree_height = _bt_getrootheight(indexRelation);
+					info->tree_height = _bt_getrootheight(indexRelation, relation);
 				}
 				else
 				{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..0188106925 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,6 +207,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
+						Relation heaprel,
 						int workMem,
 						SortCoordinate coordinate, int sortopt)
 {
@@ -260,7 +261,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	arg->tupDesc = tupDesc;		/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
 
 	if (arg->indexInfo->ii_Expressions != NULL)
 	{
@@ -361,7 +362,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 83dbee0fe6..7708b82d7d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -50,6 +50,7 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+	Relation 	heaprel;		/* the heap relation the index belongs to */
 } IndexVacuumInfo;
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 8af33d7b40..ee275650bd 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
 									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
-extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+extern void gistXLogPageReuse(Relation rel, Relation heaprel, BlockNumber blkno,
 							  FullTransactionId deleteXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
@@ -449,7 +449,8 @@ extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 Buffer leftchildbuf);
 
 extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
-								 int ntodelete, TransactionId snapshotConflictHorizon);
+								 int ntodelete, TransactionId snapshotConflictHorizon,
+								 Relation heaprel);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
 								SplitedPageLayout *dist,
@@ -485,7 +486,7 @@ extern bool gistproperty(Oid index_oid, int attno,
 extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
-extern Buffer gistNewBuffer(Relation r);
+extern Buffer gistNewBuffer(Relation r, Relation heaprel);
 extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 						   OffsetNumber off);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a2c67d1cd3..42620bbdc9 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -409,7 +409,7 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
+extern XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer,
 								   Buffer vm_buffer,
 								   TransactionId snapshotConflictHorizon,
 								   uint8 vmflags);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8f48960f9d..6dee307042 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1182,8 +1182,10 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, bool indexUnchanged,
 						 Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+							 BTStack stack);
+extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+							  BlockNumber child);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -1197,16 +1199,18 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
+extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
+								 BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, int access);
-extern Buffer _bt_gettrueroot(Relation rel);
-extern int	_bt_getrootheight(Relation rel);
-extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
+extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
+extern int	_bt_getrootheight(Relation rel, Relation heaprel);
+extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
+						 int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1229,21 +1233,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
-						  int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
-							bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+						  Buffer *bufP, int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
-							   Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
+							   bool rightmost, Snapshot snapshot);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c0ddddb2f0..31f84e90eb 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 12578e42bc..395abfe596 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,7 +399,9 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											int workMem, SortCoordinate coordinate,
 											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
-											   Relation indexRel, int workMem,
+											   Relation indexRel,
+											   Relation heaprel,
+											   int workMem,
 											   SortCoordinate coordinate,
 											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
-- 
2.34.1

#255

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#254)

Re: Minimal logical decoding on standbys

On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process
already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.

IIUC we are introducing condition variables as we can't rely on
current wait events because they will lead to spurious wakeups for
logical walsenders due to the below code in walreceiver:
XLogWalRcvFlush()
{
...
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
WalSndWakeup();

Is my understanding correct?

Can't we simply avoid waking up logical walsenders at this place and
rather wake them up at ApplyWalRecord() where the 0005 patch does
conditionvariable broadcast? Now, there doesn't seem to be anything
that distinguishes between logical and physical walsender but I guess
we can add a variable in WalSnd structure to identify it.

--
With Regards,
Amit Kapila.

#256

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#255)

Re: Minimal logical decoding on standbys

Hi,

On 3/31/23 1:58 PM, Amit Kapila wrote:

On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process
already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
IIUC we are introducing condition variables as we can't rely on
current wait events because they will lead to spurious wakeups for
logical walsenders due to the below code in walreceiver:
XLogWalRcvFlush()
{
...
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
WalSndWakeup();

Is my understanding correct?

Both the walsender and the startup process are waked up at the
same time. If the walsender does not find any new record that has been replayed
(because the startup process did not replay yet), then it will sleep during i
ts timeout time (then delaying the decoding).

The CV helps to wake up the walsender has soon as a replay is done.

Can't we simply avoid waking up logical walsenders at this place and
rather wake them up at ApplyWalRecord() where the 0005 patch does
conditionvariable broadcast? Now, there doesn't seem to be anything
that distinguishes between logical and physical walsender but I guess
we can add a variable in WalSnd structure to identify it.

That sounds like a good idea. We could imagine creating a LogicalWalSndWakeup()
doing the Walsender(s) triage based on a new variable (as you suggest).

But, it looks to me that we:

- would need to go through the list of all the walsenders to do the triage
- could wake up some logical walsender(s) unnecessary

This extra work would occur during each replay.

while with the CV, only the ones in the CV wait queue would be waked up.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#257

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#251)

Re: Minimal logical decoding on standbys

Hi,

On 2023-03-30 21:33:00 -0700, Andres Freund wrote:

diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 2ce9366277..93fb9d438a 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,11 +51,14 @@ typedef struct gistxlogDelete
{
TransactionId snapshotConflictHorizon;
uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */

-	/* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} gistxlogDelete;

-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	offsetof(gistxlogDelete, offsets)

I don't think the changes are quite sufficient:

for gist:
@@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
*/
XLogRecPtr
gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
-			   TransactionId snapshotConflictHorizon)
+			   TransactionId snapshotConflictHorizon, Relation heaprel)
{
gistxlogDelete xlrec;
XLogRecPtr	recptr;
+ xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
xlrec.ntodelete = ntodelete;
Note that gistXLogDelete() continues to register data with two different
XLogRegisterData() calls. This will append data without any padding:

XLogRegisterData((char *) &xlrec, SizeOfGistxlogDelete);

/*
* We need the target-offsets array whether or not we store the whole
* buffer, to allow us to find the snapshotConflictHorizon on a standby
* server.
*/
XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));

But replay now uses the new offset member:
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
Buffer		buffer;
Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
/*
* If we have any conflict processing to do, it must happen before we
That doesn't look right. If there's any padding before offsets, we'll afaict
read completely bogus data?

As it turns out, there is padding:

struct gistxlogDelete {
TransactionId snapshotConflictHorizon; /* 0 4 */
uint16 ntodelete; /* 4 2 */
_Bool isCatalogRel; /* 6 1 */

/* XXX 1 byte hole, try to pack */

OffsetNumber offsets[]; /* 8 0 */

/* size: 8, cachelines: 1, members: 4 */
/* sum members: 7, holes: 1, sum holes: 1 */
/* last cacheline: 8 bytes */
};

I am frankly baffled how this works at all, this should just about immediately
crash?

Oh, I see. We apparently don't reach the gist deletion code in the tests:
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174

And indeed, if I add an abort() into , it's not reached.

And it's not because tests use a temp table, the caller is also unreachable:
https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643

After writing a minimal test to reach it, it turns out to actually work - I
missed that SizeOfGistxlogDelete now includes the padding, where commonly that
pattern tries to *exclude* trailing padding. Sorry for the noise on this one
:(

I am writing two minimal test cases to reach this code for hash and
gist. Not to commit as part of this, but to be able to verify that it
works. I'll post them in the separate thread I started about the lack of
regression test coverage in the area.

Greetings,

Andres Freund

#258

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#256)

Re: Minimal logical decoding on standbys

On Fri, Mar 31, 2023 at 7:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 3/31/23 1:58 PM, Amit Kapila wrote:
On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process
already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
IIUC we are introducing condition variables as we can't rely on
current wait events because they will lead to spurious wakeups for
logical walsenders due to the below code in walreceiver:
XLogWalRcvFlush()
{
...
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
WalSndWakeup();

Is my understanding correct?
Both the walsender and the startup process are waked up at the
same time. If the walsender does not find any new record that has been replayed
(because the startup process did not replay yet), then it will sleep during i
ts timeout time (then delaying the decoding).

The CV helps to wake up the walsender has soon as a replay is done.

Can't we simply avoid waking up logical walsenders at this place and
rather wake them up at ApplyWalRecord() where the 0005 patch does
conditionvariable broadcast? Now, there doesn't seem to be anything
that distinguishes between logical and physical walsender but I guess
we can add a variable in WalSnd structure to identify it.

That sounds like a good idea. We could imagine creating a LogicalWalSndWakeup()
doing the Walsender(s) triage based on a new variable (as you suggest).

But, it looks to me that we:

- would need to go through the list of all the walsenders to do the triage
- could wake up some logical walsender(s) unnecessary

Why it could wake up unnecessarily?

This extra work would occur during each replay.

while with the CV, only the ones in the CV wait queue would be waked up.

Currently, we wake up walsenders only after writing some WAL records
at the time of flush, so won't it be better to wake up only after
applying some WAL records rather than after applying each record?

--
With Regards,
Amit Kapila.

#259

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#254)

Re: Minimal logical decoding on standbys

Hi,

On 2023-03-31 12:45:51 +0200, Drouvot, Bertrand wrote:

On 3/31/23 6:33 AM, Andres Freund wrote:

Hi,

On 2023-03-30 18:23:41 +0200, Drouvot, Bertrand wrote:

On 3/30/23 9:04 AM, Andres Freund wrote:

I think this commit is ready to go. Unless somebody thinks differently, I
think I might push it tomorrow.

Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
once 0001 is committed).

Unfortunately I did find an issue doing a pre-commit review of the patch.

The patch adds VISIBILITYMAP_IS_CATALOG_REL to xl_heap_visible.flags - but it
does not remove the bit before calling visibilitymap_set().

This ends up corrupting the visibilitymap, because the we'll set a bit for
another page.

Oh I see, I did not think about that (not enough experience in the VM area).
Nice catch and thanks for pointing out!

I pushed a commit just adding an assertion that only valid bits are passed in.

I'm also thinking of splitting the patch into two. One patch to pass down the
heap relation into the new places, and another for the rest.

I think that makes sense. I don't know how far you've work on the split but please
find attached V54 doing such a split + implementing your VISIBILITYMAP_XLOG_VALID_BITS
suggestion.

I pushed the pass-the-relation part. I removed an include of catalog.h that
was in the patch - I suspect it might have slipped in there from a later patch
in the series...

I was a bit bothered by using 'heap' instead of 'table' in so many places
(eventually we imo should standardize on the latter), but looking around the
changed places, heap was used for things like buffers etc. So I left it at
heap.

Glad we split 0001 - the rest is a lot easier to review.

Greetings,

Andres Freund

#260

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#258)

Re: Minimal logical decoding on standbys

Hi,

On 4/1/23 6:50 AM, Amit Kapila wrote:

On Fri, Mar 31, 2023 at 7:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

That sounds like a good idea. We could imagine creating a LogicalWalSndWakeup()
doing the Walsender(s) triage based on a new variable (as you suggest).

But, it looks to me that we:

- would need to go through the list of all the walsenders to do the triage
- could wake up some logical walsender(s) unnecessary

Why it could wake up unnecessarily?

I was thinking that, if a new LogicalWalSndWakeup() replaces "ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
in ApplyWalRecord() then, it could be possible that some walsender(s)
are requested to wake up while they are actually doing decoding (but I might be wrong).

This extra work would occur during each replay.

while with the CV, only the ones in the CV wait queue would be waked up.

Currently, we wake up walsenders only after writing some WAL records
at the time of flush, so won't it be better to wake up only after
applying some WAL records rather than after applying each record?

Yeah that would be better.

Do you have any idea about how (and where) we could define the "some WAL records replayed"?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#261

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#259)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/2/23 5:42 AM, Andres Freund wrote:

Hi,

On 2023-03-31 12:45:51 +0200, Drouvot, Bertrand wrote:

I pushed a commit just adding an assertion that only valid bits are passed in.

Thanks!

I'm also thinking of splitting the patch into two. One patch to pass down the
heap relation into the new places, and another for the rest.

I think that makes sense. I don't know how far you've work on the split but please
find attached V54 doing such a split + implementing your VISIBILITYMAP_XLOG_VALID_BITS
suggestion.

I pushed the pass-the-relation part.

Thanks! I just created a new thread [1]/messages/by-id/02392033-f030-a3c8-c7d0-5c27eb529fec@gmail.com for passing down the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().

I removed an include of catalog.h that
was in the patch - I suspect it might have slipped in there from a later patch
in the series...

Oops, my bad. Thanks! Yeah, indeed it's due to the split and it's in fact needed in "Add-info-in-WAL-records-in-preparation-for-logic.patch".

Please find enclosed v55 with the correction (re-adding it Add-info-in-WAL-records-in-preparation-for-logic.patch as compare to v54).

I was a bit bothered by using 'heap' instead of 'table' in so many places
(eventually we imo should standardize on the latter), but looking around the
changed places, heap was used for things like buffers etc.

yup

[1]: /messages/by-id/02392033-f030-a3c8-c7d0-5c27eb529fec@gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v55-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v55-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 1ae0c98025149f88e40e15751a757a950ce8510d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v55 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v55-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v55-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From bf7786b57064e4c95103536f5492aeabbc266f3b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v55 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..939fb8019f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27034,6 +27034,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..8c45180c35
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v55-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v55-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From e06501a6d64cf9637b2d9a4d86e2c945a08d3932 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v55 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 +++++++++++++++++++
 src/backend/replication/walsender.c       | 34 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 62 insertions(+), 8 deletions(-)
  43.2% src/backend/access/transam/
  46.1% src/backend/replication/
   3.8% src/backend/utils/activity/
   3.7% src/include/access/
   3.1% src/include/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a05ecdeddd..2a0d6c787d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1551,6 +1551,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1565,10 +1566,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1652,21 +1658,33 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still pending output.
 		 * Otherwise we might sit on sendable output data while waiting for
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
-
-		wakeEvents = WL_SOCKET_READABLE;
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case.
+			 * We are waiting for the startup process to replay wal record(s) using
+			 * a timeout in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 7940d64639..27fc8fab76 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..c8842d94bb 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -14,6 +14,7 @@
 #include "access/xlogreader.h"
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
+#include "storage/condition_variable.h"
 #include "utils/timestamp.h"
 
 /*
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 518d3b0a1f..e13abdc365 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v55-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v55-0003-Allow-logical-decoding-on-standby.patchDownload

From 063458cbaf93a5233865713bfa441731bf1f69bb Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v55 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 37 ++++++++-------
 src/backend/replication/slot.c            | 57 ++++++++++++-----------
 src/backend/replication/walsender.c       | 41 ++++++++++------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 111 insertions(+), 58 deletions(-)
   4.7% src/backend/access/transam/
  38.7% src/backend/replication/logical/
  55.6% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b85804633..f0adf51a9b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8457eec4c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));
+			break;
+		}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..743d12ba14 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,22 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level "
+							"to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +341,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5ff843047c..b7599808fc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1176,37 +1177,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be built
+		 * using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base backup
+		 * has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1222,6 +1214,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..a05ecdeddd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,31 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,7 +945,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
+				 currTLI, 			/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
 				 &errinfo))
@@ -3073,10 +3081,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3179,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v55-0002-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v55-0002-Handle-logical-slot-conflicts-on-standby.patchDownload

From 62ef17c5efe737b17050507bdfbb77d5fc5eced8 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v55 2/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 doc/src/sgml/monitoring.sgml                  |  11 +
 doc/src/sgml/system-views.sgml                |  10 +
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  20 +-
 src/backend/catalog/system_views.sql          |   6 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 189 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |  16 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/backend/utils/activity/pgstat_database.c  |   4 +
 src/backend/utils/adt/pgstatfuncs.c           |   3 +
 src/include/catalog/pg_proc.dat               |  11 +-
 src/include/pgstat.h                          |   1 +
 src/include/replication/slot.h                |  14 +-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 src/test/regress/expected/rules.out           |   8 +-
 24 files changed, 308 insertions(+), 61 deletions(-)
   5.3% doc/src/sgml/
   6.2% src/backend/access/transam/
   4.6% src/backend/replication/logical/
  55.6% src/backend/replication/
   4.4% src/backend/storage/ipc/
   6.9% src/backend/tcop/
   5.3% src/backend/
   3.8% src/include/catalog/
   5.3% src/include/replication/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d5a45f996d..87fd10401d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 08ceb91288..b856304746 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 08546b35c7..3dc638a801 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9007,6 +9009,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..4b85804633 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,22 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary does not
+		 * have a WAL level sufficient for logical decoding. No need to search
+		 * for potentially conflicting logically slots if standby is running
+		 * with wal_level lower than logical, because in that case, we would
+		 * have either disallowed creation of logical slots or invalidated existing
+		 * ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..3a8088ac03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..5ff843047c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = SlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1226,20 +1225,21 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
@@ -1247,6 +1247,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+		TransactionId slot_xmin;
+		TransactionId slot_catalog_xmin;
+		StringInfoData err_msg;
+		StringInfoData err_detail;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1268,23 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
-
-		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
-		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* the slot has been invalidated (logical decoding conflict case) */
+		if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
+		/* or the xid is valid and this is a non conflicting slot */
+					 (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin, slot_catalog_xmin, *xid))))) ||
+		/* or the slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1303,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,16 +1345,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
+				bool		send_signal = false;
+
+				initStringInfo(&err_msg);
+				initStringInfo(&err_detail);
+
+				appendStringInfo(&err_msg, "terminating process %d to release replication slot \"%s\"",
+								 active_pid,
+								 NameStr(slotname));
+				if (xid)
+				{
+					appendStringInfo(&err_msg, " because it conflicts with recovery");
+					send_signal = true;
+
+					if (TransactionIdIsValid(*xid))
+						appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+					else
+						appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+				}
+				else
+				{
+					appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									 LSN_FORMAT_ARGS(restart_lsn),
+									 (unsigned long long) (oldestLSN - restart_lsn));
+				}
+
 				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+						errmsg("%s", err_msg.data),
+						errdetail("%s", err_detail.data),
+						send_signal ? 0 : errhint("You might need to increase max_slot_wal_keep_size."));
+
+				if (send_signal)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
+
+				pfree(err_msg.data);
+				pfree(err_detail.data);
 			}
 
 			/* Wait until the slot is released. */
@@ -1355,6 +1400,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		}
 		else
 		{
+			bool		hint = false;;
+
+			initStringInfo(&err_msg);
+			initStringInfo(&err_detail);
+
 			/*
 			 * We hold the slot now and have already invalidated it; flush it
 			 * to ensure that state persists.
@@ -1370,14 +1420,37 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			appendStringInfo(&err_msg, "invalidating");
+
+			if (xid)
+			{
+				if (TransactionIdIsValid(*xid))
+					appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+				else
+					appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+			}
+			else
+			{
+				appendStringInfo(&err_msg, " obsolete replication");
+				appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+								 LSN_FORMAT_ARGS(restart_lsn),
+								 (unsigned long long) (oldestLSN - restart_lsn));
+				hint = true;
+			}
+
+			appendStringInfo(&err_msg, " slot \"%s\" because it conflicts with recovery",
+							 NameStr(slotname));
+
 
 			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+					errmsg("%s", err_msg.data),
+					errdetail("%s", err_detail.data),
+					hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+			pfree(err_msg.data);
+			pfree(err_detail.data);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1463,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
+ *
+ * Obsolete case (aka xid is NULL):
  *
- * Returns true when any slot have got invalidated.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1503,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..4207d14ae0 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
@@ -404,6 +403,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..e1c45eb2ad 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2488,6 +2488,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..013cd2b4d0 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,17 @@
 #include "storage/spin.h"
 #include "replication/walreceiver.h"
 
+#define ObsoleteSlotIsInvalid(s) (!XLogRecPtrIsInvalid(s->data.invalidated_at) && \
+								  XLogRecPtrIsInvalid(s->data.restart_lsn))
+
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											!TransactionIdIsValid(s->data.catalog_xmin))
+
+#define SlotIsInvalid(s) (ObsoleteSlotIsInvalid(s) || LogicalReplicationSlotIsInvalid (s))
+
+#define LogicalReplicationSlotXidsConflict(slot_xmin, catalog_xmin, xid) \
+		((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) || \
+		(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid)))
 /*
  * Behaviour of replication slots, upon release or crash.
  *
@@ -215,7 +226,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +238,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v55-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchtext/plain; charset=UTF-8; name=v55-0001-Add-info-in-WAL-records-in-preparation-for-logic.patchDownload

From aa55db2ad24ca4674343caedd06ec60d9f1154be Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Fri, 31 Mar 2023 08:21:16 +0000
Subject: [PATCH v55 1/6] Add info in WAL records in preparation for logical
 slot conflict handling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
It's not needed on the others struct where isCatalogRel has
been added.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello, Melanie Plageman
---
 src/backend/access/gist/gistxlog.c     | 12 ++++--------
 src/backend/access/hash/hash_xlog.c    | 12 +++---------
 src/backend/access/hash/hashinsert.c   |  1 +
 src/backend/access/heap/heapam.c       |  9 ++++++++-
 src/backend/access/heap/pruneheap.c    |  1 +
 src/backend/access/nbtree/nbtpage.c    |  2 ++
 src/backend/access/spgist/spgvacuum.c  |  1 +
 src/include/access/gistxlog.h          | 11 ++++++++---
 src/include/access/hash_xlog.h         |  8 +++++---
 src/include/access/heapam_xlog.h       |  8 ++++++--
 src/include/access/nbtxlog.h           |  8 ++++++--
 src/include/access/spgxlog.h           |  2 ++
 src/include/access/visibilitymapdefs.h | 17 +++++++++++++----
 src/include/utils/rel.h                |  1 +
 14 files changed, 61 insertions(+), 32 deletions(-)
  10.0% src/backend/access/gist/
   8.6% src/backend/access/hash/
  10.3% src/backend/access/heap/
   3.2% src/backend/access/nbtree/
  65.5% src/include/access/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 4b52719765..b7678f3c14 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	OffsetNumber *toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -203,14 +204,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfGistxlogDelete)
-		{
-			OffsetNumber *todelete;
-
-			todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
-
-			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-		}
+		PageIndexMultiDelete(page, toDelete, xldata->ntodelete);
 
 		GistClearPageHasGarbage(page);
 		GistMarkTuplesDeleted(page);
@@ -609,6 +603,7 @@ gistXLogPageReuse(Relation rel, Relation heaprel,
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = deleteXid;
@@ -678,6 +673,7 @@ gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
 	gistxlogDelete xlrec;
 	XLogRecPtr	recptr;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.ntodelete = ntodelete;
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f38b42efb9..08ceb91288 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -980,8 +980,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Page		page;
 	XLogRedoAction action;
 	HashPageOpaque pageopaque;
+	OffsetNumber *toDelete;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
+	toDelete = xldata->offsets;
 
 	/*
 	 * If we have any conflict processing to do, it must happen before we
@@ -1010,15 +1012,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	{
 		page = (Page) BufferGetPage(buffer);
 
-		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
-		{
-			OffsetNumber *unused;
-
-			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
-
-			PageIndexMultiDelete(page, unused, xldata->ntuples);
-		}
-
+		PageIndexMultiDelete(page, toDelete, xldata->ntuples);
 		/*
 		 * Mark the page as not containing any LP_DEAD items. See comments in
 		 * _hash_vacuum_one_page() for details.
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index a604e31891..22656b24e2 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -432,6 +432,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 			xl_hash_vacuum_one_page xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(hrel);
 			xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 			xlrec.ntuples = ndeletable;
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9662e38254..08546b35c7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6698,6 +6698,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
+		xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(rel);
 		xlrec.nplans = nplans;
 
 		XLogBeginInsert();
@@ -8280,6 +8281,8 @@ log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 
 	xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 	xlrec.flags = vmflags;
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		xlrec.flags |= VISIBILITYMAP_XLOG_CATALOG_REL;
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
 
@@ -8870,6 +8873,8 @@ heap_xlog_visible(XLogReaderState *record)
 	BlockNumber blkno;
 	XLogRedoAction action;
 
+	Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
+
 	XLogRecGetBlockTag(record, 1, &rlocator, NULL, &blkno);
 
 	/*
@@ -8956,6 +8961,7 @@ heap_xlog_visible(XLogReaderState *record)
 	{
 		Page		vmpage = BufferGetPage(vmbuffer);
 		Relation	reln;
+		uint8		vmbits;
 
 		/* initialize the page if it was read as zeros */
 		if (PageIsNew(vmpage))
@@ -8969,9 +8975,10 @@ heap_xlog_visible(XLogReaderState *record)
 
 		reln = CreateFakeRelcacheEntry(rlocator);
 		visibilitymap_pin(reln, blkno, &vmbuffer);
+		vmbits = (xlrec->flags & VISIBILITYMAP_VALID_BITS);
 
 		visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
-						  xlrec->snapshotConflictHorizon, xlrec->flags);
+						  xlrec->snapshotConflictHorizon, vmbits);
 
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4e65cbcadf..3f0342351f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -418,6 +418,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
+			xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(relation);
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
 			xlrec.ndead = prstate.ndead;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ee996b5660..151ad37a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -836,6 +836,7 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
 	 */
 
 	/* XLOG stuff */
+	xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec_reuse.locator = rel->rd_locator;
 	xlrec_reuse.block = blkno;
 	xlrec_reuse.snapshotConflictHorizon = safexid;
@@ -1358,6 +1359,7 @@ _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_delete xlrec_delete;
 
+		xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 		xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec_delete.ndeleted = ndeletable;
 		xlrec_delete.nupdated = nupdatable;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 3cff71e720..2f4a4aad24 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -503,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
 	spgxlogVacuumRedirect xlrec;
 	GlobalVisState *vistest;
 
+	xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
 	xlrec.nToPlaceholder = 0;
 	xlrec.snapshotConflictHorizon = InvalidTransactionId;
 
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 2ce9366277..93fb9d438a 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -51,11 +51,14 @@ typedef struct gistxlogDelete
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		ntodelete;		/* number of deleted offsets */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
+	/* TODELETE OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } gistxlogDelete;
 
-#define SizeOfGistxlogDelete	(offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
+#define SizeOfGistxlogDelete	offsetof(gistxlogDelete, offsets)
 
 /*
  * Backup Blk 0: If this operation completes a page split, by inserting a
@@ -98,9 +101,11 @@ typedef struct gistxlogPageReuse
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, snapshotConflictHorizon) + sizeof(FullTransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, isCatalogRel) + sizeof(bool))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 9894ab9afe..6c5535fe73 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16			ntuples;
+	bool        isCatalogRel;   /* to handle recovery conflict during logical
+								 * decoding on standby */
 
-	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
+	/* TARGET OFFSET NUMBERS */
+	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
 } xl_hash_vacuum_one_page;
 
-#define SizeOfHashVacuumOnePage \
-	(offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
+#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42620bbdc9..08db7e62dd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -245,10 +245,12 @@ typedef struct xl_heap_prune
 	TransactionId snapshotConflictHorizon;
 	uint16		nredirected;
 	uint16		ndead;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 	/* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;
 
-#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, isCatalogRel) + sizeof(bool))
 
 /*
  * The vacuum page record is similar to the prune record, but can only mark
@@ -344,13 +346,15 @@ typedef struct xl_heap_freeze_page
 {
 	TransactionId snapshotConflictHorizon;
 	uint16		nplans;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*
 	 * In payload of blk 0 : FREEZE PLANS and OFFSET NUMBER ARRAY
 	 */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
+#define SizeOfHeapFreezePage	(offsetof(xl_heap_freeze_page, isCatalogRel) + sizeof(bool))
 
 /*
  * This is what we need to know about setting a visibility map bit
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7dd67257f2..823c700dee 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -188,9 +188,11 @@ typedef struct xl_btree_reuse_page
 	RelFileLocator locator;
 	BlockNumber block;
 	FullTransactionId snapshotConflictHorizon;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 } xl_btree_reuse_page;
 
-#define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
+#define SizeOfBtreeReusePage	(offsetof(xl_btree_reuse_page, isCatalogRel) + sizeof(bool))
 
 /*
  * xl_btree_vacuum and xl_btree_delete records describe deletion of index
@@ -235,6 +237,8 @@ typedef struct xl_btree_delete
 	TransactionId snapshotConflictHorizon;
 	uint16		ndeleted;
 	uint16		nupdated;
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/*----
 	 * In payload of blk 0 :
@@ -245,7 +249,7 @@ typedef struct xl_btree_delete
 	 */
 } xl_btree_delete;
 
-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, isCatalogRel) + sizeof(bool))
 
 /*
  * The offsets that appear in xl_btree_update metadata are offsets into the
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index b9d6753533..75267a4914 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -240,6 +240,8 @@ typedef struct spgxlogVacuumRedirect
 	uint16		nToPlaceholder; /* number of redirects to make placeholders */
 	OffsetNumber firstPlaceholder;	/* first placeholder tuple to remove */
 	TransactionId snapshotConflictHorizon;	/* newest XID of removed redirects */
+	bool        isCatalogRel;	/* to handle recovery conflict during logical
+								 * decoding on standby */
 
 	/* offsets of redirect tuples to make placeholders follow */
 	OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
diff --git a/src/include/access/visibilitymapdefs.h b/src/include/access/visibilitymapdefs.h
index 9165b9456b..1fe1fe092d 100644
--- a/src/include/access/visibilitymapdefs.h
+++ b/src/include/access/visibilitymapdefs.h
@@ -17,9 +17,18 @@
 #define BITS_PER_HEAPBLOCK 2
 
 /* Flags for bit map */
-#define VISIBILITYMAP_ALL_VISIBLE	0x01
-#define VISIBILITYMAP_ALL_FROZEN	0x02
-#define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
-											 * flags bits */
+#define VISIBILITYMAP_ALL_VISIBLE								0x01
+#define VISIBILITYMAP_ALL_FROZEN								0x02
+#define VISIBILITYMAP_VALID_BITS								0x03	/* OR of all valid visibilitymap
+																		 * flags bits */
+/*
+ * To detect recovery conflicts during logical decoding on a standby, we need
+ * to know if a table is a user catalog table. For that we add an additional
+ * bit into xl_heap_visible.flags, in addition to the above.
+ *
+ * NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
+ */
+#define VISIBILITYMAP_XLOG_CATALOG_REL	0x04
+#define VISIBILITYMAP_XLOG_VALID_BITS	(VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)
 
 #endif							/* VISIBILITYMAPDEFS_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c0ddddb2f0..31f84e90eb 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -16,6 +16,7 @@
 
 #include "access/tupdesc.h"
 #include "access/xlog.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_index.h"
 #include "catalog/pg_publication.h"
-- 
2.34.1

#262

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#260)

Re: Minimal logical decoding on standbys

On Sun, 2023-04-02 at 10:11 +0200, Drouvot, Bertrand wrote:

I was thinking that, if a new LogicalWalSndWakeup() replaces
"ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
in ApplyWalRecord() then, it could be possible that some walsender(s)
are requested to wake up while they are actually doing decoding (but
I might be wrong).

I don't think that's a problem, right?

We are concerned about wakeups when they happen repeatedly when there's
no work to do, or when the wakeup doesn't happen when it should (and we
need to wait for a timeout).

Currently, we wake up walsenders only after writing some WAL
records
at the time of flush, so won't it be better to wake up only after
applying some WAL records rather than after applying each record?

Yeah that would be better.

Why? If the walsender is asleep, and there's work to be done, why not
wake it up?

If it's already doing work, and the latch gets repeatedly set, that
doesn't look like a problem either. The comment on SetLatch() says:

/*
* Sets a latch and wakes up anyone waiting on it.
*
* This is cheap if the latch is already set, otherwise not so much.

Regards,
Jeff Davis

#263

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#261)

Re: Minimal logical decoding on standbys

Hi,

Btw, most of the patches have some things that pgindent will change (and some
that my editor will highlight). It wouldn't hurt to run pgindent for the later
patches...

Pushed the WAL format change.

On 2023-04-02 10:27:45 +0200, Drouvot, Bertrand wrote:

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery. Arrange for a new pg_stat_database_conflicts field:
confl_active_logicalslot.

Add a new field "conflicting" in pg_replication_slots.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
doc/src/sgml/monitoring.sgml | 11 +
doc/src/sgml/system-views.sgml | 10 +
src/backend/access/gist/gistxlog.c | 2 +
src/backend/access/hash/hash_xlog.c | 1 +
src/backend/access/heap/heapam.c | 3 +
src/backend/access/nbtree/nbtxlog.c | 2 +
src/backend/access/spgist/spgxlog.c | 1 +
src/backend/access/transam/xlog.c | 20 +-
src/backend/catalog/system_views.sql | 6 +-
.../replication/logical/logicalfuncs.c | 13 +-
src/backend/replication/slot.c | 189 ++++++++++++++----
src/backend/replication/slotfuncs.c | 16 +-
src/backend/replication/walsender.c | 7 +
src/backend/storage/ipc/procsignal.c | 3 +
src/backend/storage/ipc/standby.c | 13 +-
src/backend/tcop/postgres.c | 28 +++
src/backend/utils/activity/pgstat_database.c | 4 +
src/backend/utils/adt/pgstatfuncs.c | 3 +
src/include/catalog/pg_proc.dat | 11 +-
src/include/pgstat.h | 1 +
src/include/replication/slot.h | 14 +-
src/include/storage/procsignal.h | 1 +
src/include/storage/standby.h | 2 +
src/test/regress/expected/rules.out | 8 +-
24 files changed, 308 insertions(+), 61 deletions(-)
5.3% doc/src/sgml/
6.2% src/backend/access/transam/
4.6% src/backend/replication/logical/
55.6% src/backend/replication/
4.4% src/backend/storage/ipc/
6.9% src/backend/tcop/
5.3% src/backend/
3.8% src/include/catalog/
5.3% src/include/replication/

I think it might be worth trying to split this up a bit.

restart_lsn = s->data.restart_lsn;
-
-		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
-		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* the slot has been invalidated (logical decoding conflict case) */
+		if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
+		/* or the xid is valid and this is a non conflicting slot */
+					 (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin, slot_catalog_xmin, *xid))))) ||
+		/* or the slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
{

This still looks nearly unreadable. I suggest moving comments outside of the
if (), remove redundant parentheses, use a function to detect if the slot has
been invalidated.

@@ -1329,16 +1345,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
*/
if (last_signaled_pid != active_pid)
{
+				bool		send_signal = false;
+
+				initStringInfo(&err_msg);
+				initStringInfo(&err_detail);
+
+				appendStringInfo(&err_msg, "terminating process %d to release replication slot \"%s\"",
+								 active_pid,
+								 NameStr(slotname));

For this to be translatable you need to use _("message").

+				if (xid)
+				{
+					appendStringInfo(&err_msg, " because it conflicts with recovery");
+					send_signal = true;
+
+					if (TransactionIdIsValid(*xid))
+						appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+					else
+						appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+				}
+				else
+				{
+					appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									 LSN_FORMAT_ARGS(restart_lsn),
+									 (unsigned long long) (oldestLSN - restart_lsn));
+				}
+
ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+						errmsg("%s", err_msg.data),
+						errdetail("%s", err_detail.data),
+						send_signal ? 0 : errhint("You might need to increase max_slot_wal_keep_size."));
+
+				if (send_signal)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
last_signaled_pid = active_pid;
+
+				pfree(err_msg.data);
+				pfree(err_detail.data);
}

/* Wait until the slot is released. */
@@ -1355,6 +1400,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
}
else
{
+			bool		hint = false;;
+
+			initStringInfo(&err_msg);
+			initStringInfo(&err_detail);
+
/*
* We hold the slot now and have already invalidated it; flush it
* to ensure that state persists.
@@ -1370,14 +1420,37 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
ReplicationSlotMarkDirty();
ReplicationSlotSave();
ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			appendStringInfo(&err_msg, "invalidating");
+
+			if (xid)
+			{
+				if (TransactionIdIsValid(*xid))
+					appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+				else
+					appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+			}
+			else

These are nearly the same messags as above. This is too much code to duplicate
between terminating and invalidating. Put this into a helper or such.

@@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
/* Intentional fall through to session cancel */
/* FALLTHROUGH */

+ case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:

The case: above is explicitl falling through. This makes no sense here as far
as I can tell. I thought you did change this in response to my last comment
about it?

index 8872c80cdf..013cd2b4d0 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,17 @@
#include "storage/spin.h"
#include "replication/walreceiver.h"

+#define ObsoleteSlotIsInvalid(s) (!XLogRecPtrIsInvalid(s->data.invalidated_at) && \
+								  XLogRecPtrIsInvalid(s->data.restart_lsn))
+
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											!TransactionIdIsValid(s->data.catalog_xmin))
+
+#define SlotIsInvalid(s) (ObsoleteSlotIsInvalid(s) || LogicalReplicationSlotIsInvalid (s))
+
+#define LogicalReplicationSlotXidsConflict(slot_xmin, catalog_xmin, xid) \
+		((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) || \
+		(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid)))

Can you make these static inlines instead?

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8457eec4c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));

Please don't break error messages into multiple lines, makes it harder to grep
for.

Greetings,

Andres Freund

#264

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#252)

Re: Minimal logical decoding on standbys

On Fri, 2023-03-31 at 02:44 -0700, Jeff Davis wrote:

Thank you, done. I think the nearby line was also wrong, returning
true
when there was no timeout. I combined the lines and got rid of the
early return so it can check the list and timeout condition like
normal. Attached.

On second (third?) thought, I think I was right the first time. It
passes the flag WL_EXIT_ON_PM_DEATH (included in the
ConditionVariableWaitSet), so a WL_POSTMASTER_DEATH event should not be
returned.

Also, I think the early return is correct. The current code in
ConditionVariableTimedSleep() still checks the wait list even if
WaitLatch() returns WL_TIMEOUT (it ignores the return), but I don't see
why it can't early return true. For a socket event in
ConditionVariableEventSleep() I think it should early return false.

Regards,
Jeff Davis

#265

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#252)

Re: Minimal logical decoding on standbys

Hi,

On 2023-03-31 02:44:33 -0700, Jeff Davis wrote:

From 2f05cab9012950ae9290fccbb6366d50fc01553e Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 1 Mar 2023 20:02:42 -0800
Subject: [PATCH v2] Introduce ConditionVariableEventSleep().

The new API takes a WaitEventSet which can include socket events. The
WaitEventSet must have been created by
ConditionVariableWaitSetCreate(), another new function, so that it
includes the wait events necessary for a condition variable.

Why not offer a function to add a CV to a WES? It seems somehow odd to require
going through condition_variable.c to create a WES.

Greetings,

Andres Freund

#266

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#265)

Re: Minimal logical decoding on standbys

On Sun, 2023-04-02 at 14:35 -0700, Andres Freund wrote:

Why not offer a function to add a CV to a WES? It seems somehow odd
to require
going through condition_variable.c to create a WES.

I agree that it's a bit odd, but remember that after waiting on a CV's
latch, it needs to re-insert itself into the CV's wait list.

A WaitEventSetWait() can't do that, unless we move the details of re-
adding to the wait list into latch.c. I considered that, but latch.c
already implements the APIs for WaitEventSet and Latch, so it felt
complex to also make it responsible for ConditionVariable.

I'm open to suggestion.

Regards,
Jeff Davis

#267

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#266)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-02 15:15:44 -0700, Jeff Davis wrote:

On Sun, 2023-04-02 at 14:35 -0700, Andres Freund wrote:

Why not offer a function to add a CV to a WES? It seems somehow odd
to require
going through condition_variable.c to create a WES.

I agree that it's a bit odd, but remember that after waiting on a CV's
latch, it needs to re-insert itself into the CV's wait list.

A WaitEventSetWait() can't do that, unless we move the details of re-
adding to the wait list into latch.c. I considered that, but latch.c
already implements the APIs for WaitEventSet and Latch, so it felt
complex to also make it responsible for ConditionVariable.

I agree that the *wait* has to go through condition_variable.c, but it doesn't
seem right that creation of the WES needs to go through condition_variable.c.

The only thing that ConditionVariableEventSleep() seems to require is that the
WES is waiting for MyLatch. You don't even need a separate WES for that, the
already existing WES should suffice.

Greetings,

Andres Freund

#268

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#253)

Re: Minimal logical decoding on standbys

On Fri, 2023-03-31 at 02:50 -0700, Jeff Davis wrote:

But if the ConditionVariableEventSleep() API is added, then I think
we
should change the non-recovery case to use a CV as well for
consistency, and it would avoid the need for WalSndWakeup().

It seems like what we ultimately want is for WalSndWakeup() to
selectively wake up physical and/or logical walsenders depending on the
caller. For instance:

WalSndWakeup(bool physical, bool logical)

The callers:

* On promotion, StartupXLog would call:
- WalSndWakeup(true, true)
* XLogFlush/XLogBackgroundFlush/XLogWalRcvFlush would call:
- WalSndWakeup(true, !RecoveryInProgress())
* ApplyWalRecord would call:
- WalSndWakeup(switchedTLI, switchedTLI || RecoveryInProgress())

There seem to be two approaches to making that work:

1. Use two ConditionVariables, and WalSndWakeup would broadcast to one
or both depending on its arguments.

2. Have a "replicaiton_kind" variable in WalSnd (either set based on
MyDatabaseId==InvalidOid, or set at START_REPLICATION time) to indicate
whether it's a physical or logical walsender. WalSndWakeup would wake
up the right walsenders based on its arguments.

#2 seems simpler at least for now. Would that work?

Regards,
Jeff Davis

#269

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#267)

Re: Minimal logical decoding on standbys

On Sun, 2023-04-02 at 15:29 -0700, Andres Freund wrote:

I agree that the *wait* has to go through condition_variable.c, but
it doesn't
seem right that creation of the WES needs to go through
condition_variable.c.

The kind of WES required by a CV is an implementation detail, so I was
concerned about making too many assumptions across different APIs.

But what I ended up with is arguably not better, so perhaps I should do
it your way and then just have some comments about what assumptions are
being made?

The only thing that ConditionVariableEventSleep() seems to require is
that the
WES is waiting for MyLatch. You don't even need a separate WES for
that, the
already existing WES should suffice.

By "already existing" WES, I assume you mean FeBeWaitSet? Yes, that
mostly matches, but it uses WL_POSTMASTER_DEATH instead of
WL_EXIT_ON_PM_DEATH, so I'd need to handle PM death in
condition_variable.c. That's trivial to do, though.

Regards,
Jeff Davis

#270

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 3 years ago

In reply to: Drouvot, Bertrand (#244)

Re: Minimal logical decoding on standbys

From 56a9559555918a99c202a0924f7b2ede9de4e75d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v52 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Hmm, not sure if it really applies here, but this sounds similar to
issues with track_commit_timestamps: namely, if the primary has it
enabled and you start a standby with it enabled, that's fine; but if the
primary is later shut down (but the standby isn't) and then the primary
restarted with a lesser value, then the standby would misbehave without
any obvious errors. If that is a real problem, then perhaps you can
solve it by copying some of the logic from track_commit_timestamps,
which took a large number of iterations to get right.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)

#271

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#262)

Re: Minimal logical decoding on standbys

On Mon, Apr 3, 2023 at 1:31 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Sun, 2023-04-02 at 10:11 +0200, Drouvot, Bertrand wrote:

I was thinking that, if a new LogicalWalSndWakeup() replaces
"ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
in ApplyWalRecord() then, it could be possible that some walsender(s)
are requested to wake up while they are actually doing decoding (but
I might be wrong).

I don't think that's a problem, right?

Agreed, I also don't see a problem because of the reason you mentioned
below that if the latch is already set, we won't do anything in
SetLatch.

We are concerned about wakeups when they happen repeatedly when there's
no work to do, or when the wakeup doesn't happen when it should (and we
need to wait for a timeout).

Currently, we wake up walsenders only after writing some WAL
records
at the time of flush, so won't it be better to wake up only after
applying some WAL records rather than after applying each record?

Yeah that would be better.

Why? If the walsender is asleep, and there's work to be done, why not
wake it up?

I think we can wake it up when there is work to be done even if the
work unit is smaller. The reason why I mentioned waking up the
walsender only after processing some records is to avoid the situation
where it may not need to wait again after decoding very few records.
But probably the logic in WalSndWaitForWal() will help us to exit
before starting to wait by checking the replay location.

--
With Regards,
Amit Kapila.

#272

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#268)

Re: Minimal logical decoding on standbys

On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2023-03-31 at 02:50 -0700, Jeff Davis wrote:

But if the ConditionVariableEventSleep() API is added, then I think
we
should change the non-recovery case to use a CV as well for
consistency, and it would avoid the need for WalSndWakeup().

It seems like what we ultimately want is for WalSndWakeup() to
selectively wake up physical and/or logical walsenders depending on the
caller. For instance:

WalSndWakeup(bool physical, bool logical)

The callers:

* On promotion, StartupXLog would call:
- WalSndWakeup(true, true)
* XLogFlush/XLogBackgroundFlush/XLogWalRcvFlush would call:
- WalSndWakeup(true, !RecoveryInProgress())
* ApplyWalRecord would call:
- WalSndWakeup(switchedTLI, switchedTLI || RecoveryInProgress())

There seem to be two approaches to making that work:

1. Use two ConditionVariables, and WalSndWakeup would broadcast to one
or both depending on its arguments.

2. Have a "replicaiton_kind" variable in WalSnd (either set based on
MyDatabaseId==InvalidOid, or set at START_REPLICATION time) to indicate
whether it's a physical or logical walsender. WalSndWakeup would wake
up the right walsenders based on its arguments.

#2 seems simpler at least for now. Would that work?

Agreed, even Bertrand and myself discussed the same approach few
emails above. BTW, if we have this selective logic to wake
physical/logical walsenders and for standby's, we only wake logical
walsenders at the time of ApplyWalRecord() then do we need the new
conditional variable enhancement being discussed, and if so, why?

--
With Regards,
Amit Kapila.

#273

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#271)

Re: Minimal logical decoding on standbys

Hi,

On 4/3/23 7:20 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 1:31 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Sun, 2023-04-02 at 10:11 +0200, Drouvot, Bertrand wrote:

I was thinking that, if a new LogicalWalSndWakeup() replaces
"ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
in ApplyWalRecord() then, it could be possible that some walsender(s)
are requested to wake up while they are actually doing decoding (but
I might be wrong).

I don't think that's a problem, right?

Agreed, I also don't see a problem because of the reason you mentioned
below that if the latch is already set, we won't do anything in
SetLatch.

Thanks for the feedback, I do agree too after Jeff's and your explanation.

We are concerned about wakeups when they happen repeatedly when there's
no work to do, or when the wakeup doesn't happen when it should (and we
need to wait for a timeout).

Currently, we wake up walsenders only after writing some WAL
records
at the time of flush, so won't it be better to wake up only after
applying some WAL records rather than after applying each record?

Yeah that would be better.

Why? If the walsender is asleep, and there's work to be done, why not
wake it up?

I think we can wake it up when there is work to be done even if the
work unit is smaller. The reason why I mentioned waking up the
walsender only after processing some records is to avoid the situation
where it may not need to wait again after decoding very few records.
But probably the logic in WalSndWaitForWal() will help us to exit
before starting to wait by checking the replay location.

Okay, I'll re-write the sub-patch related to the startup/walsender corner
case with this new approach.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#274

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#272)

Re: Minimal logical decoding on standbys

Hi,

On 4/3/23 7:35 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2023-03-31 at 02:50 -0700, Jeff Davis wrote:

But if the ConditionVariableEventSleep() API is added, then I think
we
should change the non-recovery case to use a CV as well for
consistency, and it would avoid the need for WalSndWakeup().

It seems like what we ultimately want is for WalSndWakeup() to
selectively wake up physical and/or logical walsenders depending on the
caller. For instance:

WalSndWakeup(bool physical, bool logical)

The callers:

* On promotion, StartupXLog would call:
- WalSndWakeup(true, true)
* XLogFlush/XLogBackgroundFlush/XLogWalRcvFlush would call:
- WalSndWakeup(true, !RecoveryInProgress())
* ApplyWalRecord would call:
- WalSndWakeup(switchedTLI, switchedTLI || RecoveryInProgress())

There seem to be two approaches to making that work:

1. Use two ConditionVariables, and WalSndWakeup would broadcast to one
or both depending on its arguments.

2. Have a "replicaiton_kind" variable in WalSnd (either set based on
MyDatabaseId==InvalidOid, or set at START_REPLICATION time) to indicate
whether it's a physical or logical walsender. WalSndWakeup would wake
up the right walsenders based on its arguments.

#2 seems simpler at least for now. Would that work?

Agreed, even Bertrand and myself discussed the same approach few
emails above. BTW, if we have this selective logic to wake
physical/logical walsenders and for standby's, we only wake logical
walsenders at the time of ApplyWalRecord() then do we need the new
conditional variable enhancement being discussed, and if so, why?

Thank you both for this new idea and discussion. In that case I don't think
we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
a new proposal with this new approach.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#275

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Alvaro Herrera (#270)

Re: Minimal logical decoding on standbys

On Mon, Apr 3, 2023 at 4:39 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

From 56a9559555918a99c202a0924f7b2ede9de4e75d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v52 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Hmm, not sure if it really applies here, but this sounds similar to
issues with track_commit_timestamps: namely, if the primary has it
enabled and you start a standby with it enabled, that's fine; but if the
primary is later shut down (but the standby isn't) and then the primary
restarted with a lesser value, then the standby would misbehave without
any obvious errors.

IIUC, the patch deals it by invalidating logical slots while replaying
the XLOG_PARAMETER_CHANGE record on standby. Then later during
decoding, if it encounters XLOG_PARAMETER_CHANGE, and wal_level from
primary has been reduced, it will return an error. There is a race
condition here as explained in the patch as follows:

+ /*
+ * If wal_level on primary is reduced to less than logical, then we
+ * want to prevent existing logical slots from being used.
+ * Existing logical slots on standby get invalidated when this WAL
+ * record is replayed; and further, slot creation fails when the
+ * wal level is not sufficient; but all these operations are not
+ * synchronized, so a logical slot may creep in while the wal_level
+ * is being reduced. Hence this extra check.
+ */
+ if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("logical decoding on standby requires "
+ "wal_level >= logical on master")));

Now, during this race condition, say not only does a logical slot
creep in but also one tries to decode WAL using the same then some
misbehavior is expected. I have not tried this so not sure if this is
really a problem but are you worried about something along those
lines?

If that is a real problem, then perhaps you can
solve it by copying some of the logic from track_commit_timestamps,
which took a large number of iterations to get right.

IIUC, track_commit_timestamps deactivates the CommitTs module (by
using state in the shared memory) when replaying the
XLOG_PARAMETER_CHANGE record. Then later using that state it gives an
error from the appropriate place in the CommitTs module. If my
understanding is correct then that appears to be a better design than
what the patch is currently doing. Also, the error message used in
error_commit_ts_disabled() seems to be better than the current one.

--
With Regards,
Amit Kapila.

#276

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#263)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/2/23 10:10 PM, Andres Freund wrote:

Hi,

Btw, most of the patches have some things that pgindent will change (and some
that my editor will highlight). It wouldn't hurt to run pgindent for the later
patches...

done.

Pushed the WAL format change.

Thanks!

On 2023-04-02 10:27:45 +0200, Drouvot, Bertrand wrote:

5.3% doc/src/sgml/
6.2% src/backend/access/transam/
4.6% src/backend/replication/logical/
55.6% src/backend/replication/
4.4% src/backend/storage/ipc/
6.9% src/backend/tcop/
5.3% src/backend/
3.8% src/include/catalog/
5.3% src/include/replication/

I think it might be worth trying to split this up a bit.

Okay. Split in 2 parts in V56 enclosed.

One part to handle logical slot conflicts on standby, and one part
to arrange for a new pg_stat_database_conflicts and pg_replication_slots field.

restart_lsn = s->data.restart_lsn;
-
-		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
-		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		slot_xmin = s->data.xmin;
+		slot_catalog_xmin = s->data.catalog_xmin;
+
+		/* the slot has been invalidated (logical decoding conflict case) */
+		if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
+		/* or the xid is valid and this is a non conflicting slot */
+					 (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin, slot_catalog_xmin, *xid))))) ||
+		/* or the slot has been invalidated (obsolete LSN case) */
+			(!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
{

This still looks nearly unreadable. I suggest moving comments outside of the
if (), remove redundant parentheses, use a function to detect if the slot has
been invalidated.

I made it as simple as:

/*
* If the slot is already invalid or is a non conflicting slot, we don't
* need to do anything.
*/
islogical = xid ? true : false;

if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))

in V56 attached.

@@ -1329,16 +1345,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
*/
if (last_signaled_pid != active_pid)
{
+				bool		send_signal = false;
+
+				initStringInfo(&err_msg);
+				initStringInfo(&err_detail);
+
+				appendStringInfo(&err_msg, "terminating process %d to release replication slot \"%s\"",
+								 active_pid,
+								 NameStr(slotname));

For this to be translatable you need to use _("message").

Thanks!

+				if (xid)
+				{
+					appendStringInfo(&err_msg, " because it conflicts with recovery");
+					send_signal = true;
+
+					if (TransactionIdIsValid(*xid))
+						appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+					else
+						appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+				}
+				else
+				{
+					appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
+									 LSN_FORMAT_ARGS(restart_lsn),
+									 (unsigned long long) (oldestLSN - restart_lsn));
+				}
+
ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+						errmsg("%s", err_msg.data),
+						errdetail("%s", err_detail.data),
+						send_signal ? 0 : errhint("You might need to increase max_slot_wal_keep_size."));
+
+				if (send_signal)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
last_signaled_pid = active_pid;
+
+				pfree(err_msg.data);
+				pfree(err_detail.data);
}

/* Wait until the slot is released. */
@@ -1355,6 +1400,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
}
else
{
+			bool		hint = false;;
+
+			initStringInfo(&err_msg);
+			initStringInfo(&err_detail);
+
/*
* We hold the slot now and have already invalidated it; flush it
* to ensure that state persists.
@@ -1370,14 +1420,37 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
ReplicationSlotMarkDirty();
ReplicationSlotSave();
ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
+
+			appendStringInfo(&err_msg, "invalidating");
+
+			if (xid)
+			{
+				if (TransactionIdIsValid(*xid))
+					appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
+				else
+					appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least logical on the primary server");
+			}
+			else

These are nearly the same messags as above. This is too much code to duplicate
between terminating and invalidating. Put this into a helper or such.

ReportTerminationInvalidation() added in V56 for this purpose.

@@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
/* Intentional fall through to session cancel */
/* FALLTHROUGH */

+ case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:

The case: above is explicitl falling through. This makes no sense here as far
as I can tell.

There is an if "reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT" in the
PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT case, so that seems ok to me.

Or are you saying that you'd prefer to see the PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT case somewhere else?
If so, where?

I thought you did change this in response to my last comment
about it?

Yes.

index 8872c80cdf..013cd2b4d0 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -17,6 +17,17 @@
#include "storage/spin.h"
#include "replication/walreceiver.h"

+#define ObsoleteSlotIsInvalid(s) (!XLogRecPtrIsInvalid(s->data.invalidated_at) && \
+								  XLogRecPtrIsInvalid(s->data.restart_lsn))
+
+#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
+											!TransactionIdIsValid(s->data.catalog_xmin))
+
+#define SlotIsInvalid(s) (ObsoleteSlotIsInvalid(s) || LogicalReplicationSlotIsInvalid (s))
+
+#define LogicalReplicationSlotXidsConflict(slot_xmin, catalog_xmin, xid) \
+		((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) || \
+		(TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid)))

Can you make these static inlines instead?

Done.

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8457eec4c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* can restart from there.
*/
break;
+		case XLOG_PARAMETER_CHANGE:
+		{
+			xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+			/*
+			 * If wal_level on primary is reduced to less than logical, then we
+			 * want to prevent existing logical slots from being used.
+			 * Existing logical slots on standby get invalidated when this WAL
+			 * record is replayed; and further, slot creation fails when the
+			 * wal level is not sufficient; but all these operations are not
+			 * synchronized, so a logical slot may creep in while the wal_level
+			 * is being reduced. Hence this extra check.
+			 */
+			if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("logical decoding on standby requires wal_level "
+								"to be at least logical on the primary server")));

Please don't break error messages into multiple lines, makes it harder to grep
for.

done.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v56-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v56-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From ef41057eb31edd4eb97a7104ca87a31c8ad9745a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v56 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v56-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v56-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 5a6410ea958756aa1690ed36a4c8ffa417e94e44 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v56 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 710 ++++++++++++++++++
 7 files changed, 800 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..939fb8019f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27034,6 +27034,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..8c45180c35
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,710 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+# reduce wal_sender_timeout to not wait too long after promotion
+$node_standby->append_conf('postgresql.conf',qq[
+	wal_sender_timeout = 1s
+]);
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v56-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v56-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 41b83ef19ceb7ed8c970036e6fc09ebe0e4a9329 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:00:29 +0000
Subject: [PATCH v56 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 ++++++++++++++++
 src/backend/replication/walsender.c       | 41 +++++++++++++++++------
 src/backend/utils/activity/wait_event.c   |  3 ++
 src/include/access/xlogrecovery.h         |  3 ++
 src/include/replication/walsender.h       |  1 +
 src/include/utils/wait_event.h            |  1 +
 6 files changed, 66 insertions(+), 11 deletions(-)
  37.5% src/backend/access/transam/
  53.1% src/backend/replication/
   3.3% src/backend/utils/activity/
   3.2% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..8a9505a52d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;
 
+	/* Replay state (see check_for_replay() for more explanation) */
+	ConditionVariable replayedCV;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;
 
@@ -468,6 +471,7 @@ XLogRecoveryShmemInit(void)
 	SpinLockInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
+	ConditionVariableInit(&XLogRecoveryCtl->replayedCV);
 }
 
 /*
@@ -1935,6 +1939,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * wake up walsender(s) used by logical decoding on standby.
+	 */
+	ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -4942,3 +4951,22 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */
+ConditionVariable *
+check_for_replay(void)
+{
+	return &XLogRecoveryCtl->replayedCV;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aeb5f93514..d85826d7f5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1552,6 +1552,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 {
 	int			wakeEvents;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	ConditionVariable *replayedCV = check_for_replay();
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1566,10 +1567,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 	if (!RecoveryInProgress())
 		RecentFlushPtr = GetFlushRecPtr(NULL);
 	else
+	{
 		RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+		/* Prepare the replayedCV to sleep */
+		ConditionVariablePrepareToSleep(replayedCV);
+	}
 
 	for (;;)
 	{
+
 		long		sleeptime;
 
 		/* Clear any already-pending wakeups */
@@ -1653,21 +1659,34 @@ WalSndWaitForWal(XLogRecPtr loc)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
-		/*
-		 * Sleep until something happens or we time out.  Also wait for the
-		 * socket becoming writable, if there's still pending output.
-		 * Otherwise we might sit on sendable output data while waiting for
-		 * new WAL to be generated.  (But if we have nothing to send, we don't
-		 * want to wake on socket-writable.)
-		 */
 		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
 
-		wakeEvents = WL_SOCKET_READABLE;
+		/*
+		 * When not in recovery, sleep until something happens or we time out.
+		 * Also wait for the socket becoming writable, if there's still
+		 * pending output. Otherwise we might sit on sendable output data
+		 * while waiting for new WAL to be generated.  (But if we have nothing
+		 * to send, we don't want to wake on socket-writable.)
+		 */
+		if (!RecoveryInProgress())
+		{
+			wakeEvents = WL_SOCKET_READABLE;
 
-		if (pq_is_send_pending())
-			wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (pq_is_send_pending())
+				wakeEvents |= WL_SOCKET_WRITEABLE;
 
-		WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+		}
+		else
+		{
+			/*
+			 * We are in the logical decoding on standby case. We are waiting
+			 * for the startup process to replay wal record(s) using a timeout
+			 * in case we are requested to stop.
+			 */
+			ConditionVariableTimedSleep(replayedCV, sleeptime,
+										WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
+		}
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 7940d64639..27fc8fab76 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -466,6 +466,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
 			event_name = "WalReceiverWaitStart";
 			break;
+		case WAIT_EVENT_WAL_SENDER_WAIT_REPLAY:
+			event_name = "WalReceiverWaitReplay";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..c8842d94bb 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -14,6 +14,7 @@
 #include "access/xlogreader.h"
 #include "catalog/pg_control.h"
 #include "lib/stringinfo.h"
+#include "storage/condition_variable.h"
 #include "utils/timestamp.h"
 
 /*
@@ -155,4 +156,6 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern ConditionVariable *check_for_replay(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..2fd745fe72 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_H
 
 #include <signal.h>
+#include "storage/condition_variable.h"
 
 /*
  * What to do with a snapshot in create replication slot command.
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 518d3b0a1f..e13abdc365 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -131,6 +131,7 @@ typedef enum
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_WAL_RECEIVER_EXIT,
 	WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+	WAIT_EVENT_WAL_SENDER_WAIT_REPLAY,
 	WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.34.1

v56-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v56-0003-Allow-logical-decoding-on-standby.patchDownload

From 557f1c2898c55c625976c0d6377723e3855b06d7 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v56 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 46 +++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 113 insertions(+), 61 deletions(-)
   4.6% src/backend/access/transam/
  37.7% src/backend/replication/logical/
  56.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 10085aa0d6..779f5c3711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..b37b91bbe0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 265ed0f84c..0091793658 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..aeb5f93514 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,32 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +946,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v56-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v56-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From 40608113cfba814a394fb2fdc898703c9bfff5d5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v56 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 13 ++++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 58 insertions(+), 9 deletions(-)
  32.6% doc/src/sgml/
   7.9% src/backend/catalog/
  15.7% src/backend/replication/
   5.8% src/backend/utils/activity/
   5.5% src/backend/utils/adt/
  23.9% src/include/catalog/
   6.7% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d5a45f996d..87fd10401d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..3a8088ac03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..1f1f076558 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v56-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v56-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From 2d82d7e062241efccdd07de75d3f1eef37d775ce Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v56 1/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  21 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 190 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/include/replication/slot.h                |  57 +++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 296 insertions(+), 51 deletions(-)
   7.1% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  58.6% src/backend/replication/
   5.0% src/backend/storage/ipc/
   7.8% src/backend/tcop/
  12.6% src/include/replication/
   3.3% src/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a4..371e855683 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9009,6 +9011,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..10085aa0d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..265ed0f84c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool islogical,
+										  int pid, NameData slotname,
+										  TransactionId *xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = SlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,91 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool islogical, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_msg);
+	initStringInfo(&err_detail);
+
+	if (terminating)
+		appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
+						 pid,
+						 NameStr(slotname));
+	else
+		appendStringInfo(&err_msg, _("invalidating"));
+
+	if (islogical)
+	{
+		if (terminating)
+			appendStringInfo(&err_msg, _(" because it conflicts with recovery"));
+
+		if (TransactionIdIsValid(*xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		if (!terminating)
+			appendStringInfo(&err_msg, _(" obsolete replication"));
+
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	if (!terminating)
+		appendStringInfo(&err_msg, _(" slot \"%s\" because it conflicts with recovery"),
+						 NameStr(slotname));
+
+	ereport(LOG,
+			errmsg("%s", err_msg.data),
+			errdetail("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_msg.data);
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		islogical;
 
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1332,22 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		islogical = xid ? true : false;
+
+		if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1366,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1408,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, islogical, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (islogical)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1449,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, islogical, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1466,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
+ *
+ * Obsolete case (aka xid is NULL):
  *
- * Returns true when any slot have got invalidated.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1506,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..e1c45eb2ad 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2488,6 +2488,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..782bf658c3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,60 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->data.xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+SlotIsInvalid(ReplicationSlot *s, bool islogical)
+{
+	if (islogical)
+		return LogicalReplicationSlotIsInvalid(s);
+	else
+		return ObsoleteSlotIsInvalid(s, false);
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_xmin = s->data.xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+SlotIsNotConflicting(ReplicationSlot *s, bool islogical, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (islogical)
+		return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+	else
+		return SlotIsFreshEnough(s, *oldestLSN);
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +269,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +281,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#277

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 3 years ago

In reply to: Drouvot, Bertrand (#276)

Re: Minimal logical decoding on standbys

On 2023-Apr-03, Drouvot, Bertrand wrote:

+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool islogical, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{

+	if (terminating)
+		appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
+						 pid,
+						 NameStr(slotname));
+	else
+		appendStringInfo(&err_msg, _("invalidating"));
+
+	if (islogical)
+	{
+		if (terminating)
+			appendStringInfo(&err_msg, _(" because it conflicts with recovery"));

You can't build the strings this way, because it's not possible to put
the strings into the translation machinery. You need to write full
strings for each separate case instead, without appending other string
parts later.

Thanks

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)

#278

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#274)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/3/23 8:10 AM, Drouvot, Bertrand wrote:

Hi,

On 4/3/23 7:35 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:

Agreed, even Bertrand and myself discussed the same approach few
emails above. BTW, if we have this selective logic to wake
physical/logical walsenders and for standby's, we only wake logical
walsenders at the time of ApplyWalRecord() then do we need the new
conditional variable enhancement being discussed, and if so, why?

Thank you both for this new idea and discussion. In that case I don't think
we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
a new proposal with this new approach.

Please find enclosed V57 implementing the new approach in 0004. With the new approach in place
the TAP tests (0005) work like a charm (no delay and even after a promotion).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v57-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v57-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 0d198484e008090a524562076326054be56935ca Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v57 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v57-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v57-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 4e392c06e39f36c4185780a21f2b90c7c6a97de4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v57 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 705 ++++++++++++++++++
 7 files changed, 795 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..939fb8019f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27034,6 +27034,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..98bc85c140
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,705 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v57-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v57-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 5ab4bb1f376c0a0e7e270f4290668407abf40de9 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH v57 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new replication_kind variable to the WalSnd struct and moved
the call to WalSndWakeup() in ApplyWalRecord().

The new replication_kind variable helps to filter what kind of walsender
we want to wakeup based on the code path.
---
 src/backend/access/transam/xlog.c           |  6 +++---
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 10 ++++------
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 13 +++++++++++--
 src/include/replication/walsender.h         | 20 ++++++++++----------
 src/include/replication/walsender_private.h |  3 +++
 7 files changed, 33 insertions(+), 23 deletions(-)
  32.5% src/backend/access/transam/
  28.5% src/backend/replication/
  38.9% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 779f5c3711..70ac8fc33b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5773,7 +5773,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b5..d06fdc74c0 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..18551cc3b3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,10 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/* Wakeup walsender(s) */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+				 switchedTLI || RecoveryInProgress());
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1962,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d..d2aa93734c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, !RecoveryInProgress());
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aeb5f93514..d5d1d5600c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2626,6 +2626,12 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3314,13 +3320,14 @@ WalSndShmemInit(void)
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
 		/*
@@ -3329,9 +3336,11 @@ WalSndWakeup(void)
 		 */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+							  (logical && kind == REPLICATION_KIND_LOGICAL)))
 			SetLatch(latch);
 	}
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..c6e4515201 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
+#define WalSndWakeupProcessRequests(physical, logical)		\
+	do														\
+	{														\
+		if (wake_wal_senders)								\
+		{													\
+			wake_wal_senders = false;						\
+			if (max_wal_senders > 0)						\
+				WalSndWakeup(physical, logical);			\
+		}													\
 	} while (0)
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..ff25aa70a8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
-- 
2.34.1

v57-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v57-0003-Allow-logical-decoding-on-standby.patchDownload

From fed8b2ff9ce61cee5dc9c06d9e3287b877a29527 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v57 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 46 +++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 113 insertions(+), 61 deletions(-)
   4.6% src/backend/access/transam/
  37.7% src/backend/replication/logical/
  56.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 10085aa0d6..779f5c3711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..b37b91bbe0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 265ed0f84c..0091793658 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..aeb5f93514 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,32 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +946,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v57-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v57-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From 296d282f663f0bc6ba97d0896b0e95d7ff2d5994 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v57 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 13 ++++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 58 insertions(+), 9 deletions(-)
  32.6% doc/src/sgml/
   7.9% src/backend/catalog/
  15.7% src/backend/replication/
   5.8% src/backend/utils/activity/
   5.5% src/backend/utils/adt/
  23.9% src/include/catalog/
   6.7% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d5a45f996d..87fd10401d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..3a8088ac03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..1f1f076558 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v57-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v57-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From 90675b0797a698655fca84c1eb061de0aa7bc996 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v57 1/6] Handle logical slot conflicts on standby.

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  21 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 190 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/include/replication/slot.h                |  57 +++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 296 insertions(+), 51 deletions(-)
   7.1% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  58.6% src/backend/replication/
   5.0% src/backend/storage/ipc/
   7.8% src/backend/tcop/
  12.6% src/include/replication/
   3.3% src/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a4..371e855683 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9009,6 +9011,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..10085aa0d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..265ed0f84c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool islogical,
+										  int pid, NameData slotname,
+										  TransactionId *xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = SlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,91 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool islogical, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_msg);
+	initStringInfo(&err_detail);
+
+	if (terminating)
+		appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
+						 pid,
+						 NameStr(slotname));
+	else
+		appendStringInfo(&err_msg, _("invalidating"));
+
+	if (islogical)
+	{
+		if (terminating)
+			appendStringInfo(&err_msg, _(" because it conflicts with recovery"));
+
+		if (TransactionIdIsValid(*xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		if (!terminating)
+			appendStringInfo(&err_msg, _(" obsolete replication"));
+
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	if (!terminating)
+		appendStringInfo(&err_msg, _(" slot \"%s\" because it conflicts with recovery"),
+						 NameStr(slotname));
+
+	ereport(LOG,
+			errmsg("%s", err_msg.data),
+			errdetail("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_msg.data);
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		islogical;
 
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1332,22 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		islogical = xid ? true : false;
+
+		if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1366,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1408,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, islogical, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (islogical)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1449,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, islogical, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1466,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
+ *
+ * Obsolete case (aka xid is NULL):
  *
- * Returns true when any slot have got invalidated.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1506,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50..fb2de8542d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3137,6 +3140,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..782bf658c3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,60 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->data.xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+SlotIsInvalid(ReplicationSlot *s, bool islogical)
+{
+	if (islogical)
+		return LogicalReplicationSlotIsInvalid(s);
+	else
+		return ObsoleteSlotIsInvalid(s, false);
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_xmin = s->data.xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+SlotIsNotConflicting(ReplicationSlot *s, bool islogical, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (islogical)
+		return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+	else
+		return SlotIsFreshEnough(s, *oldestLSN);
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +269,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +281,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#279

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Alvaro Herrera (#277)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-03 17:34:52 +0200, Alvaro Herrera wrote:

On 2023-Apr-03, Drouvot, Bertrand wrote:

+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool islogical, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{

+	if (terminating)
+		appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
+						 pid,
+						 NameStr(slotname));
+	else
+		appendStringInfo(&err_msg, _("invalidating"));
+
+	if (islogical)
+	{
+		if (terminating)
+			appendStringInfo(&err_msg, _(" because it conflicts with recovery"));

Hm? That's what the _'s do. We build strings in parts in other places too.

You do need to use errmsg_internal() later, to prevent that format string from
being translated as well.

I'm not say that this is exactly the right way, don't get me wrong.

Greetings,

Andres Freund

#280

Masahiko Sawada

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#278)

Re: Minimal logical decoding on standbys

On Tue, Apr 4, 2023 at 3:17 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On 4/3/23 8:10 AM, Drouvot, Bertrand wrote:

Hi,

On 4/3/23 7:35 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:

Agreed, even Bertrand and myself discussed the same approach few
emails above. BTW, if we have this selective logic to wake
physical/logical walsenders and for standby's, we only wake logical
walsenders at the time of ApplyWalRecord() then do we need the new
conditional variable enhancement being discussed, and if so, why?

Thank you both for this new idea and discussion. In that case I don't think
we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
a new proposal with this new approach.

Please find enclosed V57 implementing the new approach in 0004.

Regarding 0004 patch:

@@ -2626,6 +2626,12 @@ InitWalSenderSlot(void)
                        walsnd->sync_standby_priority = 0;
                        walsnd->latch = &MyProc->procLatch;
                        walsnd->replyTime = 0;
+
+                       if (MyDatabaseId == InvalidOid)
+                               walsnd->kind = REPLICATION_KIND_PHYSICAL;
+                       else
+                               walsnd->kind = REPLICATION_KIND_LOGICAL;
+

I think we might want to set the replication kind when processing the
START_REPLICATION command. The walsender using a logical replication
slot is not necessarily streaming (e.g. when COPYing table data).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#281

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#276)

Re: Minimal logical decoding on standbys

On Mon, Apr 3, 2023 at 8:51 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/2/23 10:10 PM, Andres Freund wrote:
Hi,
restart_lsn = s->data.restart_lsn;
-
-            /*
-             * If the slot is already invalid or is fresh enough, we don't need to
-             * do anything.
-             */
-            if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+            slot_xmin = s->data.xmin;
+            slot_catalog_xmin = s->data.catalog_xmin;
+
+            /* the slot has been invalidated (logical decoding conflict case) */
+            if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
+            /* or the xid is valid and this is a non conflicting slot */
+                                     (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin, slot_catalog_xmin, *xid))))) ||
+            /* or the slot has been invalidated (obsolete LSN case) */
+                    (!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
{
This still looks nearly unreadable. I suggest moving comments outside of the
if (), remove redundant parentheses, use a function to detect if the slot has
been invalidated.
I made it as simple as:

/*
* If the slot is already invalid or is a non conflicting slot, we don't
* need to do anything.
*/
islogical = xid ? true : false;

if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))

in V56 attached.

Here the variable 'islogical' doesn't seem to convey its actual
meaning because one can imagine that it indicates whether the slot is
logical which I don't think is the actual intent. One idea to simplify
this is to introduce a single function CanInvalidateSlot() or
something like that and move the logic from both the functions
SlotIsInvalid() and SlotIsNotConflicting() into the new function.

--
With Regards,
Amit Kapila.

#282

Masahiko Sawada

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#280)

Re: Minimal logical decoding on standbys

On Tue, Apr 4, 2023 at 10:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Apr 4, 2023 at 3:17 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On 4/3/23 8:10 AM, Drouvot, Bertrand wrote:

Hi,

On 4/3/23 7:35 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:

Agreed, even Bertrand and myself discussed the same approach few
emails above. BTW, if we have this selective logic to wake
physical/logical walsenders and for standby's, we only wake logical
walsenders at the time of ApplyWalRecord() then do we need the new
conditional variable enhancement being discussed, and if so, why?

Thank you both for this new idea and discussion. In that case I don't think
we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
a new proposal with this new approach.

Please find enclosed V57 implementing the new approach in 0004.

Regarding 0004 patch:
@@ -2626,6 +2626,12 @@ InitWalSenderSlot(void)
walsnd->sync_standby_priority = 0;
walsnd->latch = &MyProc->procLatch;
walsnd->replyTime = 0;
+
+                       if (MyDatabaseId == InvalidOid)
+                               walsnd->kind = REPLICATION_KIND_PHYSICAL;
+                       else
+                               walsnd->kind = REPLICATION_KIND_LOGICAL;
+
I think we might want to set the replication kind when processing the
START_REPLICATION command. The walsender using a logical replication
slot is not necessarily streaming (e.g. when COPYing table data).

Discussing with Bertrand off-list, it's wrong as the logical
replication slot creation also needs to read WAL records so a
walsender who is creating a logical replication slot needs to be woken
up. We can set it the replication kind when processing
START_REPLICATION and CREATE_REPLICATION_SLOT, but it seems better to
set it in one place. So I agree to set it in InitWalSenderSlot().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#283

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#281)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/4/23 7:57 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 8:51 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

I made it as simple as:

/*
* If the slot is already invalid or is a non conflicting slot, we don't
* need to do anything.
*/
islogical = xid ? true : false;

if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))

in V56 attached.

Here the variable 'islogical' doesn't seem to convey its actual
meaning because one can imagine that it indicates whether the slot is
logical which I don't think is the actual intent.

Good point. Just renamed it to 'check_on_xid' (as still needed outside of
the "CanInvalidateSlot" context) in V58 attached.

One idea to simplify
this is to introduce a single function CanInvalidateSlot() or
something like that and move the logic from both the functions
SlotIsInvalid() and SlotIsNotConflicting() into the new function.

Oh right, even better, thanks!
Done in V58 and now this is as simple as:

+ if (DoNotInvalidateSlot(s, xid, &oldestLSN))
{
/* then, we are not forcing for invalidation */

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v58-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v58-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 7c61edd93b4df1efb9723ddae41c009ccbac9f59 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v58 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v58-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v58-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 14f9f51aaacd8889ef1f9853534c9303fc89f59f Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v58 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 705 ++++++++++++++++++
 7 files changed, 795 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..939fb8019f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27034,6 +27034,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..98bc85c140
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,705 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v58-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v58-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 58ae631010562d094e8e54e4607651b9bce81ae2 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH v58 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new replication_kind variable to the WalSnd struct and moved
the call to WalSndWakeup() in ApplyWalRecord().

The new replication_kind variable helps to filter what kind of walsender
we want to wakeup based on the code path.
---
 src/backend/access/transam/xlog.c           |  6 +++---
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 10 ++++-----
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 24 +++++++++++++++++++--
 src/include/replication/walsender.h         | 20 ++++++++---------
 src/include/replication/walsender_private.h |  3 +++
 7 files changed, 44 insertions(+), 23 deletions(-)
  25.0% src/backend/access/transam/
  45.0% src/backend/replication/
  29.9% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 779f5c3711..70ac8fc33b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5773,7 +5773,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b5..d06fdc74c0 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..18551cc3b3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,10 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/* Wakeup walsender(s) */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+				 switchedTLI || RecoveryInProgress());
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1962,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d..d2aa93734c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, !RecoveryInProgress());
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aeb5f93514..9b68e87ad8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2626,6 +2626,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3314,13 +3331,14 @@ WalSndShmemInit(void)
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
 		/*
@@ -3329,9 +3347,11 @@ WalSndWakeup(void)
 		 */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+							  (logical && kind == REPLICATION_KIND_LOGICAL)))
 			SetLatch(latch);
 	}
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..c6e4515201 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
+#define WalSndWakeupProcessRequests(physical, logical)		\
+	do														\
+	{														\
+		if (wake_wal_senders)								\
+		{													\
+			wake_wal_senders = false;						\
+			if (max_wal_senders > 0)						\
+				WalSndWakeup(physical, logical);			\
+		}													\
 	} while (0)
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..ff25aa70a8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
-- 
2.34.1

v58-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v58-0003-Allow-logical-decoding-on-standby.patchDownload

From 70826f6ebd7f7a3fc944637a3476e56f90635336 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v58 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 46 +++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 113 insertions(+), 61 deletions(-)
   4.6% src/backend/access/transam/
  37.7% src/backend/replication/logical/
  56.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 10085aa0d6..779f5c3711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..b37b91bbe0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index fd7a00a9a2..ca79f30b47 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..aeb5f93514 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,32 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +946,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v58-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v58-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From 4c4590e0be72f15a3403a21225179195a3e524f0 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v58 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 13 ++++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 58 insertions(+), 9 deletions(-)
  32.6% doc/src/sgml/
   7.9% src/backend/catalog/
  15.7% src/backend/replication/
   5.8% src/backend/utils/activity/
   5.5% src/backend/utils/adt/
  23.9% src/include/catalog/
   6.7% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 65da082ed9..ff59c7960a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..3a8088ac03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..1f1f076558 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,17 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.xmin == InvalidTransactionId &&
+				slot_contents.data.catalog_xmin == InvalidTransactionId)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v58-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v58-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From 725f0e606a7054615b3b246d84b47f92bcc0f06a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v58 1/6] Handle logical slot conflicts on standby.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy, Amit Kapila, Álvaro Herrera
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  21 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 190 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/include/replication/slot.h                |  55 ++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 294 insertions(+), 51 deletions(-)
   7.1% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  58.5% src/backend/replication/
   5.0% src/backend/storage/ipc/
   7.8% src/backend/tcop/
  12.7% src/include/replication/
   3.3% src/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a4..371e855683 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9009,6 +9011,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..10085aa0d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..070fd378e8 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both xmin and catalog_xmin are valid. Avoid
+		 * "cannot get changes" wording in this errmsg because that'd be
+		 * confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..fd7a00a9a2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool check_on_xid,
+										  int pid, NameData slotname,
+										  TransactionId *xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,93 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_msg);
+	initStringInfo(&err_detail);
+
+	if (terminating)
+		appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
+						 pid,
+						 NameStr(slotname));
+	else
+		appendStringInfo(&err_msg, _("invalidating"));
+
+	if (check_on_xid)
+	{
+		if (terminating)
+			appendStringInfo(&err_msg, _(" because it conflicts with recovery"));
+
+		if (TransactionIdIsValid(*xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		if (!terminating)
+			appendStringInfo(&err_msg, _(" obsolete replication"));
+
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	if (!terminating)
+		appendStringInfo(&err_msg, _(" slot \"%s\" because it conflicts with recovery"),
+						 NameStr(slotname));
+
+	ereport(LOG,
+			errmsg("%s", err_msg.data),
+			errdetail("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_msg.data);
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		check_on_xid;
+
+	check_on_xid = xid ? true : false;
 
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1334,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, &oldestLSN))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1366,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->data.xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1408,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, check_on_xid, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (check_on_xid)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1449,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, check_on_xid, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1466,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
+ *
+ * Obsolete case (aka xid is NULL):
  *
- * Returns true when any slot have got invalidated.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1506,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50..fb2de8542d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3137,6 +3140,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..479d40b527 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,58 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->data.xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_xmin = s->data.xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
+{
+	return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
+
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +267,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +279,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#284

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#282)

Re: Minimal logical decoding on standbys

Hi,

On 4/4/23 9:48 AM, Masahiko Sawada wrote:

On Tue, Apr 4, 2023 at 10:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Regarding 0004 patch:
@@ -2626,6 +2626,12 @@ InitWalSenderSlot(void)
walsnd->sync_standby_priority = 0;
walsnd->latch = &MyProc->procLatch;
walsnd->replyTime = 0;
+
+                       if (MyDatabaseId == InvalidOid)
+                               walsnd->kind = REPLICATION_KIND_PHYSICAL;
+                       else
+                               walsnd->kind = REPLICATION_KIND_LOGICAL;
+
I think we might want to set the replication kind when processing the
START_REPLICATION command. The walsender using a logical replication
slot is not necessarily streaming (e.g. when COPYing table data).
Discussing with Bertrand off-list, it's wrong as the logical
replication slot creation also needs to read WAL records so a
walsender who is creating a logical replication slot needs to be woken
up. We can set it the replication kind when processing
START_REPLICATION and CREATE_REPLICATION_SLOT, but it seems better to
set it in one place. So I agree to set it in InitWalSenderSlot().

Thanks for the review and feedback!
Added a comment in 0004 in V58 just posted up-thread to explain the reason
why the walsnd->kind assignment is done InitWalSenderSlot().

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#285

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 3 years ago

In reply to: Andres Freund (#279)

Re: Minimal logical decoding on standbys

Hi,

On 2023-Apr-03, Andres Freund wrote:

Hm? That's what the _'s do. We build strings in parts in other places too.

No, what _() does is mark each piece for translation separately. But a
translation cannot be done on string pieces, and later have all the
pieces appended together to form a full sentence. Let me show the
"!terminating" case as example and grab some translations for it from
src/backend/po/de.po:

"invalidating" -> "... wird ungültig gemacht" (?)

(if logical) " obsolete replication" -> " obsolete Replikation"

" slot \"%s\" because it conflicts with recovery" -> " Slot \"%s\", weil sie in Konflikt mit Wiederherstellung steht"

If you just concatenate all the translated phrases together, the
resulting string will make no sense; keep in mind the "obsolete
replication" part may or not may not be there. And there's no way to
make that work: even if you found an ordering of the English parts that
allows you to translate each piece separately and have it make sense for
German, the same won't work for Spanish or Japanese.

You have to give the translator a complete phrase and let them turn into
a complete translated phrases. Building from parts doesn't work. We're
very good at avoiding string building; we have a couple of cases, but
they are *very* minor.

string 1 "invalidating slot \"%s\" because it conflicts with recovery"

string 2 "invalidating obsolete replication slot \"%s\" because it conflicts with recovery"

(I'm not clear on why did Bertrand omitted the word "replication" in the
case where the slot is not logical)

I think the errdetail() are okay, it's the errmsg() bits that are bogus.

And yes, well caught on having to use errmsg_internal and
errdetail_internal() to avoid double translation.

Cheers

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#286

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#283)

Re: Minimal logical decoding on standbys

On Tue, Apr 4, 2023 at 3:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;
+
+ slot_xmin = s->data.xmin;
+ slot_catalog_xmin = s->data.catalog_xmin;
+
+ return (((TransactionIdIsValid(slot_xmin) &&
TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||

For logical slots, slot->data.xmin will always be an
InvalidTransactionId. It will only be set/updated for physical slots.
So, it is not clear to me why in this and other related functions, you
are referring to and or invalidating it.

--
With Regards,
Amit Kapila.

#287

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#286)

Re: Minimal logical decoding on standbys

Hi,

On 4/4/23 1:43 PM, Amit Kapila wrote:

On Tue, Apr 4, 2023 at 3:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;
+
+ slot_xmin = s->data.xmin;
+ slot_catalog_xmin = s->data.catalog_xmin;
+
+ return (((TransactionIdIsValid(slot_xmin) &&
TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||
For logical slots, slot->data.xmin will always be an
InvalidTransactionId. It will only be set/updated for physical slots.
So, it is not clear to me why in this and other related functions, you
are referring to and or invalidating it.

I think you're right that invalidating/checking only on the catalog xmin is
enough for logical slot (I'm not sure how I ended up taking the xmin into account but
that seems useless indeed).

I'll submit a new version to deal with the catalog xmin only, thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#288

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#287)

Re: Minimal logical decoding on standbys

On Tue, Apr 4, 2023 at 6:05 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/4/23 1:43 PM, Amit Kapila wrote:
On Tue, Apr 4, 2023 at 3:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;
+
+ slot_xmin = s->data.xmin;
+ slot_catalog_xmin = s->data.catalog_xmin;
+
+ return (((TransactionIdIsValid(slot_xmin) &&
TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||
For logical slots, slot->data.xmin will always be an
InvalidTransactionId. It will only be set/updated for physical slots.
So, it is not clear to me why in this and other related functions, you
are referring to and or invalidating it.
I think you're right that invalidating/checking only on the catalog xmin is
enough for logical slot (I'm not sure how I ended up taking the xmin into account but
that seems useless indeed).

I think we might want to consider slot's effective_xmin instead of
data.xmin as we use that to store xmin_horizon when we build the full
snapshot.

--
With Regards,
Amit Kapila.

#289

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#288)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/4/23 3:43 PM, Amit Kapila wrote:

On Tue, Apr 4, 2023 at 6:05 PM Drouvot, Bertrand

I think we might want to consider slot's effective_xmin instead of
data.xmin as we use that to store xmin_horizon when we build the full
snapshot.

Oh, I did not know about the 'effective_xmin' and was going to rely only on the catalog xmin.

Reading the comment in the ReplicationSlot struct about the 'effective_xmin' I do think it makes sense to use it
(instead of data.xmin).

Please find attached v59 doing so.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v59-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v59-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From fad28278fa13bf3564c878aba57fb6d1e6623d59 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v59 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v59-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v59-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 9e038e69816a1b0722c15515dbfbef3310198e39 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v59 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 705 ++++++++++++++++++
 7 files changed, 795 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..939fb8019f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27034,6 +27034,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..98bc85c140
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,705 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v59-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchtext/plain; charset=UTF-8; name=v59-0004-Fixing-Walsender-corner-case-with-logical-decodi.patchDownload

From 41d5893920d9b7adc89603712750dd7838479c0a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH v59 4/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new replication_kind variable to the WalSnd struct and moved
the call to WalSndWakeup() in ApplyWalRecord().

The new replication_kind variable helps to filter what kind of walsender
we want to wakeup based on the code path.
---
 src/backend/access/transam/xlog.c           |  6 +++---
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 10 ++++-----
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 24 +++++++++++++++++++--
 src/include/replication/walsender.h         | 20 ++++++++---------
 src/include/replication/walsender_private.h |  3 +++
 7 files changed, 44 insertions(+), 23 deletions(-)
  25.0% src/backend/access/transam/
  45.0% src/backend/replication/
  29.9% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 779f5c3711..70ac8fc33b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5773,7 +5773,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b5..d06fdc74c0 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..18551cc3b3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,10 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/* Wakeup walsender(s) */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+				 switchedTLI || RecoveryInProgress());
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1962,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d..d2aa93734c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, !RecoveryInProgress());
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aeb5f93514..9b68e87ad8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2626,6 +2626,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3314,13 +3331,14 @@ WalSndShmemInit(void)
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
 		/*
@@ -3329,9 +3347,11 @@ WalSndWakeup(void)
 		 */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+							  (logical && kind == REPLICATION_KIND_LOGICAL)))
 			SetLatch(latch);
 	}
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..c6e4515201 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
+#define WalSndWakeupProcessRequests(physical, logical)		\
+	do														\
+	{														\
+		if (wake_wal_senders)								\
+		{													\
+			wake_wal_senders = false;						\
+			if (max_wal_senders > 0)						\
+				WalSndWakeup(physical, logical);			\
+		}													\
 	} while (0)
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..ff25aa70a8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
-- 
2.34.1

v59-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v59-0003-Allow-logical-decoding-on-standby.patchDownload

From 5a08514d3cbcc2aab2704704a71d9c787e9b6772 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v59 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 46 +++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 113 insertions(+), 61 deletions(-)
   4.6% src/backend/access/transam/
  37.7% src/backend/replication/logical/
  56.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 10085aa0d6..779f5c3711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..b37b91bbe0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 3e81784579..6fb7af1297 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..aeb5f93514 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,32 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +946,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v59-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v59-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From 8a701e91f4edd2657c10648e36fc3ee35144f57a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v59 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 12 +++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 57 insertions(+), 9 deletions(-)
  33.7% doc/src/sgml/
   8.1% src/backend/catalog/
  13.1% src/backend/replication/
   5.9% src/backend/utils/activity/
   5.6% src/backend/utils/adt/
  24.6% src/include/catalog/
   6.9% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 65da082ed9..ff59c7960a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..3a8088ac03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..6473c73eca 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,16 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (LogicalReplicationSlotIsInvalid(slot))
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v59-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v59-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From 5074e0917c4568b7d016d5e08dc43ac1c241b199 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v59 1/6] Handle logical slot conflicts on standby.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy, Amit Kapila, Álvaro Herrera
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  21 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 190 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/include/replication/slot.h                |  55 ++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 294 insertions(+), 51 deletions(-)
   7.0% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  58.2% src/backend/replication/
   5.0% src/backend/storage/ipc/
   7.8% src/backend/tcop/
  13.0% src/include/replication/
   3.3% src/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a4..371e855683 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9009,6 +9011,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..10085aa0d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..575a047e53 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both effective_xmin and catalog_xmin are
+		 * valid. Avoid "cannot get changes" wording in this errmsg because
+		 * that'd be confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..3e81784579 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool check_on_xid,
+										  int pid, NameData slotname,
+										  TransactionId *xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,93 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_msg);
+	initStringInfo(&err_detail);
+
+	if (terminating)
+		appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
+						 pid,
+						 NameStr(slotname));
+	else
+		appendStringInfo(&err_msg, _("invalidating"));
+
+	if (check_on_xid)
+	{
+		if (terminating)
+			appendStringInfo(&err_msg, _(" because it conflicts with recovery"));
+
+		if (TransactionIdIsValid(*xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		if (!terminating)
+			appendStringInfo(&err_msg, _(" obsolete replication"));
+
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	if (!terminating)
+		appendStringInfo(&err_msg, _(" slot \"%s\" because it conflicts with recovery"),
+						 NameStr(slotname));
+
+	ereport(LOG,
+			errmsg("%s", err_msg.data),
+			errdetail("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_msg.data);
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		check_on_xid;
+
+	check_on_xid = xid ? true : false;
 
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1334,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, &oldestLSN))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1366,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->effective_xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1408,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, check_on_xid, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (check_on_xid)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1449,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, check_on_xid, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1466,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
+ *
+ * Obsolete case (aka xid is NULL):
  *
- * Returns true when any slot have got invalidated.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1506,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50..fb2de8542d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3137,6 +3140,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..914b6aebc3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,58 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->effective_xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_effective_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_effective_xmin = s->effective_xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_effective_xmin) && TransactionIdPrecedesOrEquals(slot_effective_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
+{
+	return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
+
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +267,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +279,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#290

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Alvaro Herrera (#285)

Re: Minimal logical decoding on standbys

Hi,

On 4/4/23 1:21 PM, Alvaro Herrera wrote:

Hi,

On 2023-Apr-03, Andres Freund wrote:

Hm? That's what the _'s do. We build strings in parts in other places too.

No, what _() does is mark each piece for translation separately. But a
translation cannot be done on string pieces, and later have all the
pieces appended together to form a full sentence. Let me show the
"!terminating" case as example and grab some translations for it from
src/backend/po/de.po:

"invalidating" -> "... wird ungültig gemacht" (?)

(if logical) " obsolete replication" -> " obsolete Replikation"

" slot \"%s\" because it conflicts with recovery" -> " Slot \"%s\", weil sie in Konflikt mit Wiederherstellung steht"

If you just concatenate all the translated phrases together, the
resulting string will make no sense; keep in mind the "obsolete
replication" part may or not may not be there. And there's no way to
make that work: even if you found an ordering of the English parts that
allows you to translate each piece separately and have it make sense for
German, the same won't work for Spanish or Japanese.

You have to give the translator a complete phrase and let them turn into
a complete translated phrases. Building from parts doesn't work. We're
very good at avoiding string building; we have a couple of cases, but
they are *very* minor.

string 1 "invalidating slot \"%s\" because it conflicts with recovery"

string 2 "invalidating obsolete replication slot \"%s\" because it conflicts with recovery"

Thanks for looking at it and the explanations!

(I'm not clear on why did Bertrand omitted the word "replication" in the
case where the slot is not logical)

It makes more sense to add it, will do thanks!

I think the errdetail() are okay, it's the errmsg() bits that are bogus.

And yes, well caught on having to use errmsg_internal and
errdetail_internal() to avoid double translation.

So, IIUC having something like this would be fine?

"
if (check_on_xid)
{
if (terminating)
appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\" because it conflicts with recovery"),
pid,
NameStr(slotname));
else
appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with recovery"),
NameStr(slotname));

if (TransactionIdIsValid(*xid))
appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
else
appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
}
else
{
if (terminating)
appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
pid,
NameStr(slotname));
else
appendStringInfo(&err_msg, _("invalidating obsolete replication slot \"%s\""),
NameStr(slotname));

appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
LSN_FORMAT_ARGS(restart_lsn),
(unsigned long long) (oldestLSN - restart_lsn));

hint = true;
}

ereport(LOG,
errmsg_internal("%s", err_msg.data),
errdetail_internal("%s", err_detail.data),
hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
"

as err_msg is not concatenated anymore (I mean it's just one sentence build one time)
and this make use of errmsg_internal() and errdetail_internal().

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#291

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Alvaro Herrera (#285)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-04 13:21:38 +0200, Alvaro Herrera wrote:

On 2023-Apr-03, Andres Freund wrote:

Hm? That's what the _'s do. We build strings in parts in other places too.

No, what _() does is mark each piece for translation separately. But a
translation cannot be done on string pieces, and later have all the
pieces appended together to form a full sentence. Let me show the
"!terminating" case as example and grab some translations for it from
src/backend/po/de.po:

"invalidating" -> "... wird ungï¿½ltig gemacht" (?)

(if logical) " obsolete replication" -> " obsolete Replikation"

" slot \"%s\" because it conflicts with recovery" -> " Slot \"%s\", weil sie in Konflikt mit Wiederherstellung steht"

If you just concatenate all the translated phrases together, the
resulting string will make no sense; keep in mind the "obsolete
replication" part may or not may not be there. And there's no way to
make that work: even if you found an ordering of the English parts that
allows you to translate each piece separately and have it make sense for
German, the same won't work for Spanish or Japanese.

Ah, I misunderstood the angle you're coming from. Yes, the pieces need to be
reasonable fragments, instead of half-sentences.

Greetings,

Andres Freund

#292

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#290)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-04 18:54:33 +0200, Drouvot, Bertrand wrote:

if (check_on_xid)
{
if (terminating)
appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\" because it conflicts with recovery"),
pid,
NameStr(slotname));

FWIW, I would just use exactly the same error message as today here.

errmsg("terminating process %d to release replication slot \"%s\"",
active_pid, NameStr(slotname)),

This is accurate for both the existing and the new case. Then there's no need
to put that string into a stringinfo either.

Greetings,

Andres Freund

#293

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#283)

Re: Minimal logical decoding on standbys

On Tue, 2023-04-04 at 11:42 +0200, Drouvot, Bertrand wrote:

Done in V58 and now this is as simple as:

Minor comments on 0004 (address if you agree):

* Consider static inline for WalSndWakeupProcessRequests()?
* Is the WalSndWakeup() in KeepFileRestoredFromArchive() more like the
flush case? Why is the second argument unconditionally true? I don't
think the cascading logical walsenders have anything to do until the
WAL is actually applied.

Otherwise, looks good!

Regards,
Jeff Davis

#294

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#283)

Re: Minimal logical decoding on standbys

On Tue, Apr 4, 2023 at 5:44 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Oh right, even better, thanks!
Done in V58 and now this is as simple as:

+ if (DoNotInvalidateSlot(s, xid, &oldestLSN))
{
/* then, we are not forcing for invalidation */

Thanks for your continued work on $SUBJECT. I just took a look at
0004, and I think that at the very least the commit message needs
work. Nobody who is not a hacker is going to understand what problem
this is fixing, because it makes reference to the names of functions
and structure members rather than user-visible behavior. In fact, I'm
not really sure that I understand the problem myself. It seems like
the problem is that on a standby, WAL senders will get woken up too
early, before we have any WAL to send. That's presumably OK, in the
sense that they'll go back to sleep and eventually wake up again, but
it means they might end up chronically behind sending out WAL to
cascading standbys. If that's right, I think it should be spelled out
more clearly in the commit message, and maybe also in the code
comments.

But the weird thing is that most (all?) of the patch doesn't seem to
be about that issue at all. Instead, it's about separating wakeups of
physical walsenders from wakeups of logical walsenders. I don't see
how that could ever fix the kind of problem I mentioned in the
preceding paragraph, so my guess is that this is a separate change.
But this change doesn't really seem adequately justified. The commit
message says that it "helps to filter what kind of walsender
we want to wakeup based on the code path" but that's awfully vague
about what the actual benefit is. I wonder whether many people have a
mix of physical and logical systems connecting to the same machine
such that this would even help, and if they do have that, would this
really do enough to solve any performance problem that might be caused
by too many wakeups?

--
Robert Haas
EDB: http://www.enterprisedb.com

#295

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Robert Haas (#294)

Re: Minimal logical decoding on standbys

On Tue, 2023-04-04 at 14:55 -0400, Robert Haas wrote:

Thanks for your continued work on $SUBJECT. I just took a look at
0004, and I think that at the very least the commit message needs
work. Nobody who is not a hacker is going to understand what problem
this is fixing, because it makes reference to the names of functions
and structure members rather than user-visible behavior. In fact, I'm
not really sure that I understand the problem myself. It seems like
the problem is that on a standby, WAL senders will get woken up too
early, before we have any WAL to send.

Logical walsenders on the standby, specifically, which didn't exist
before this patch series.

That's presumably OK, in the
sense that they'll go back to sleep and eventually wake up again, but
it means they might end up chronically behind sending out WAL to
cascading standbys.

Without 0004, cascading logical walsenders would have worse wakeup
behavior than logical walsenders on the primary. Assuming the fix is
small in scope and otherwise acceptable, I think it belongs as a part
of this overall series.

If that's right, I think it should be spelled out
more clearly in the commit message, and maybe also in the code
comments.

Perhaps a commit message like:

"For cascading replication, wake up physical walsenders separately from
logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would not have been applied yet, so logical walsenders were
awakened too early."

(I'm not sure if I quite got the verb tenses right.)

For comments, I agree that WalSndWakeup() clearly needs a comment
update. The call site in ApplyWalRecord() could also use a comment. You
could add a comment at every call site, but I don't think that's
necessary if there's a good comment over WalSndWakeup().

Regards,
Jeff Davis

#296

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#295)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-04 17:33:25 -0700, Jeff Davis wrote:

On Tue, 2023-04-04 at 14:55 -0400, Robert Haas wrote:

That's presumably OK, in the
sense that they'll go back to sleep and eventually wake up again, but
it means they might end up chronically behind sending out WAL to
cascading standbys.

Without 0004, cascading logical walsenders would have worse wakeup
behavior than logical walsenders on the primary. Assuming the fix is
small in scope and otherwise acceptable, I think it belongs as a part
of this overall series.

FWIW, personally, I wouldn't feel ok with committing 0003 without 0004. And
IMO they ought to be committed the other way round. The stalls you *can* get,
depending on the speed of WAL apply and OS scheduling, can be long.

This is actually why a predecessor version of the feature had a bunch of
sleeps and retries in the tests, just to avoid those stalls. Obviously that's
not a good path...

Greetings,

Andres Freund

#297

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#275)

Re: Minimal logical decoding on standbys

On Mon, Apr 3, 2023 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 3, 2023 at 4:39 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

From 56a9559555918a99c202a0924f7b2ede9de4e75d Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:59:47 +0000
Subject: [PATCH v52 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Hmm, not sure if it really applies here, but this sounds similar to
issues with track_commit_timestamps: namely, if the primary has it
enabled and you start a standby with it enabled, that's fine; but if the
primary is later shut down (but the standby isn't) and then the primary
restarted with a lesser value, then the standby would misbehave without
any obvious errors.

IIUC, the patch deals it by invalidating logical slots while replaying
the XLOG_PARAMETER_CHANGE record on standby. Then later during
decoding, if it encounters XLOG_PARAMETER_CHANGE, and wal_level from
primary has been reduced, it will return an error. There is a race
condition here as explained in the patch as follows:
+ /*
+ * If wal_level on primary is reduced to less than logical, then we
+ * want to prevent existing logical slots from being used.
+ * Existing logical slots on standby get invalidated when this WAL
+ * record is replayed; and further, slot creation fails when the
+ * wal level is not sufficient; but all these operations are not
+ * synchronized, so a logical slot may creep in while the wal_level
+ * is being reduced. Hence this extra check.
+ */
+ if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("logical decoding on standby requires "
+ "wal_level >= logical on master")));
Now, during this race condition, say not only does a logical slot
creep in but also one tries to decode WAL using the same then some
misbehavior is expected. I have not tried this so not sure if this is
really a problem but are you worried about something along those
lines?

On further thinking, as such this shouldn't be a problem because all
the WAL records before PARAMETER_CHANGE record will have sufficient
information so that they can get decoded. However, with the current
approach, the subscriber may not even receive the valid records before
PARAMETER_CHANGE record. This is because startup process will
terminate the walsenders while invaliding the slots and after restart
the walsenders will exit because the corresponding slot will be an
invalid slot. So, it is quite possible that walsender was lagging and
wouldn't have sent records before the PARAMETER_CHANGE record making
subscriber never receive those records that it should have received. I
don't know whether this is what one would expect.

One other observation is that once this error has been raised both
standby and subscriber will keep on getting this error in the loop
unless the user manually disables the subscription on the subscriber.

--
With Regards,
Amit Kapila.

#298

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#292)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/4/23 7:53 PM, Andres Freund wrote:

Hi,

On 2023-04-04 18:54:33 +0200, Drouvot, Bertrand wrote:

if (check_on_xid)
{
if (terminating)
appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\" because it conflicts with recovery"),
pid,
NameStr(slotname));

FWIW, I would just use exactly the same error message as today here.

errmsg("terminating process %d to release replication slot \"%s\"",
active_pid, NameStr(slotname)),

This is accurate for both the existing and the new case. Then there's no need
to put that string into a stringinfo either.

Right, thanks! Did it that way in V60 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v60-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v60-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 4861568b10fe2187d38f2997ad916a984918aa5b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v60 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v60-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v60-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 3dd93d8dbc601accedceae199e539ba74252e092 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v60 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 705 ++++++++++++++++++
 7 files changed, 795 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..939fb8019f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27034,6 +27034,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..0822012e9d
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,705 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v60-0004-For-cascading-replication-wake-up-physical-walse.patchtext/plain; charset=UTF-8; name=v60-0004-For-cascading-replication-wake-up-physical-walse.patchDownload

From b811e76c4535c30417ac919ed352a023954c589e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH v60 4/6] For cascading replication, wake up physical
 walsenders separately from logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Author: Bertrand Drouvot per idea from Jeff Davis and Amit Kapila.
Reviewed-By: Sawada Masahiko, Robert Haas.
---
 src/backend/access/transam/xlog.c           |  6 ++--
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 17 ++++++----
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 36 +++++++++++++++++----
 src/include/replication/walsender.h         | 22 ++++++-------
 src/include/replication/walsender_private.h |  3 ++
 7 files changed, 59 insertions(+), 29 deletions(-)
  26.3% src/backend/access/transam/
  53.5% src/backend/replication/
  20.0% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 779f5c3711..70ac8fc33b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5773,7 +5773,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b5..f3fb92c8f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..df814e6ff7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,17 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * Wakeup walsenders:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - logical walsenders in case of new time line or recovery is in progress
+	 *  (logical decoding on standby).
+	 */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+				 switchedTLI || RecoveryInProgress());
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1969,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d..d2aa93734c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, !RecoveryInProgress());
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aeb5f93514..aaa78eed52 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2626,6 +2626,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3308,30 +3325,35 @@ WalSndShmemInit(void)
 }
 
 /*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done. This is due
+ * to the fact that for cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
  *
  * This will be called inside critical sections, so throwing an error is not
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+							  (logical && kind == REPLICATION_KIND_LOGICAL)))
 			SetLatch(latch);
 	}
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..9df7e50f94 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
-	} while (0)
+static inline void
+WalSndWakeupProcessRequests(bool physical, bool logical)
+{
+	if (wake_wal_senders)
+	{
+		wake_wal_senders = false;
+		if (max_wal_senders > 0)
+			WalSndWakeup(physical, logical);
+	}
+}
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..ff25aa70a8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
-- 
2.34.1

v60-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v60-0003-Allow-logical-decoding-on-standby.patchDownload

From 5f0d4a60eb9fad8bc4f2fd2c8c168c2d822bece3 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v60 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 22 ++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 46 +++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 113 insertions(+), 61 deletions(-)
   4.6% src/backend/access/transam/
  37.7% src/backend/replication/logical/
  56.7% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 10085aa0d6..779f5c3711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..b37b91bbe0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index aacb75bebf..efae9588f3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..aeb5f93514 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,32 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+
+	/* make sure we have enough WAL available */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/* the standby could have been promoted, so check if still in recovery */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +946,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3082,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3180,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v60-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v60-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From abbcd53b92c22205771d36996a1eda6f02d46ee4 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v60 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 12 +++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 57 insertions(+), 9 deletions(-)
  33.7% doc/src/sgml/
   8.1% src/backend/catalog/
  13.1% src/backend/replication/
   5.9% src/backend/utils/activity/
   5.6% src/backend/utils/adt/
  24.6% src/include/catalog/
   6.9% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fd0ffbb1e0..9fc585b9e7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6b098234f8..c25067d06d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..6473c73eca 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,16 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (LogicalReplicationSlotIsInvalid(slot))
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v60-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v60-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From 6879e222dcf35ddf802caa78d0f1f7d118511c61 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v60 1/6] Handle logical slot conflicts on standby.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy, Amit Kapila, Álvaro Herrera
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  21 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 189 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |  28 +++
 src/include/replication/slot.h                |  55 ++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 293 insertions(+), 51 deletions(-)
   7.0% src/backend/access/transam/
   5.3% src/backend/replication/logical/
  58.1% src/backend/replication/
   5.0% src/backend/storage/ipc/
   7.8% src/backend/tcop/
  13.0% src/include/replication/
   3.3% src/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a4..371e855683 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9009,6 +9011,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..10085aa0d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..575a047e53 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both effective_xmin and catalog_xmin are
+		 * valid. Avoid "cannot get changes" wording in this errmsg because
+		 * that'd be confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..aacb75bebf 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool check_on_xid,
+										  int pid, NameData slotname,
+										  TransactionId *xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,92 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	if (check_on_xid)
+	{
+		if (!terminating)
+		{
+			initStringInfo(&err_msg);
+			appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with recovery"),
+							 NameStr(slotname));
+		}
+
+		if (TransactionIdIsValid(*xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		if (!terminating)
+		{
+			initStringInfo(&err_msg);
+			appendStringInfo(&err_msg, _("invalidating obsolete replication slot \"%s\""),
+							 NameStr(slotname));
+		}
+
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	ereport(LOG,
+			terminating ? errmsg("terminating process %d to release replication slot \"%s\"", pid, NameStr(slotname)) :
+			errmsg_internal("%s", err_msg.data),
+			errdetail_internal("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	if (!terminating)
+		pfree(err_msg.data);
+
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		check_on_xid;
+
+	check_on_xid = xid ? true : false;
 
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1333,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, &oldestLSN))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1365,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->effective_xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1407,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, check_on_xid, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (check_on_xid)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1448,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, check_on_xid, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1465,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1505,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50..fb2de8542d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3137,6 +3140,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* Intentional fall through to session cancel */
 				/* FALLTHROUGH */
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+
+				/*
+				 * For conflicts that require a logical slot to be
+				 * invalidated, the requirement is for the signal receiver to
+				 * release the slot, so that it could be invalidated by the
+				 * signal sender. So for normal backends, the transaction
+				 * should be aborted, just like for other recovery conflicts.
+				 * But if it's walsender on standby, we don't want to go
+				 * through the following IsTransactionOrTransactionBlock()
+				 * check, so break here.
+				 */
+				if (am_cascading_walsender &&
+					reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
+					MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
+				{
+					RecoveryConflictPending = true;
+					QueryCancelPending = true;
+					InterruptPending = true;
+					break;
+				}
+
+				/* Intentional fall through to session cancel */
+				/* FALLTHROUGH */
+
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..914b6aebc3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,58 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->effective_xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_effective_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_effective_xmin = s->effective_xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_effective_xmin) && TransactionIdPrecedesOrEquals(slot_effective_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
+{
+	return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
+
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +267,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +279,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#299

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#293)

Re: Minimal logical decoding on standbys

Hi,

On 4/4/23 8:13 PM, Jeff Davis wrote:

On Tue, 2023-04-04 at 11:42 +0200, Drouvot, Bertrand wrote:

Done in V58 and now this is as simple as:

Minor comments on 0004 (address if you agree):

Thanks for the review!

* Consider static inline for WalSndWakeupProcessRequests()?

Agree and done in V60 just shared up-thread.

* Is the WalSndWakeup() in KeepFileRestoredFromArchive() more like the
flush case? Why is the second argument unconditionally true? I don't
think the cascading logical walsenders have anything to do until the
WAL is actually applied.

Agree and changed it to "(true, false)" in V60.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#300

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#295)

Re: Minimal logical decoding on standbys

Hi,

On 4/5/23 2:33 AM, Jeff Davis wrote:

On Tue, 2023-04-04 at 14:55 -0400, Robert Haas wrote:

Thanks for your continued work on $SUBJECT. I just took a look at
0004,

Thanks Robert for the feedback!

and I think that at the very least the commit message needs
work.

Agree.

Perhaps a commit message like:

"For cascading replication, wake up physical walsenders separately from
logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would not have been applied yet, so logical walsenders were
awakened too early."

Thanks Jeff for the commit message proposal! It looks good to me
except that I think that "flushed WAL could have been not applied yet" is better than
"flushed WAL would not have been applied yet" but it's obviously open to discussion.

Currently changed it that way and used it in V60 shared up-thread.

For comments, I agree that WalSndWakeup() clearly needs a comment
update. The call site in ApplyWalRecord() could also use a comment. You
could add a comment at every call site, but I don't think that's
necessary if there's a good comment over WalSndWakeup().

Agree, added a comment over WalSndWakeup() and one before calling WalSndWakeup()
in ApplyWalRecord() in V60.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#301

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#297)

Re: Minimal logical decoding on standbys

Hi,

On 4/5/23 8:59 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On further thinking, as such this shouldn't be a problem because all
the WAL records before PARAMETER_CHANGE record will have sufficient
information so that they can get decoded. However, with the current
approach, the subscriber may not even receive the valid records before
PARAMETER_CHANGE record. This is because startup process will
terminate the walsenders while invaliding the slots and after restart
the walsenders will exit because the corresponding slot will be an
invalid slot. So, it is quite possible that walsender was lagging and
wouldn't have sent records before the PARAMETER_CHANGE record making
subscriber never receive those records that it should have received.

Agree that would behave that way.

I don't know whether this is what one would expect.

If one change wal_level to < logical on the primary, he should at least
know that:

"
Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
"

If the doc has been read (as the quote above is coming from 0006).

I think that what is missing is the "when" the slots are invalidated.

Maybe we could change the doc with something among those lines instead?

"
Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.

It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
primary won't be decoded".

I don't know whether this is what one would expect but that should be less of a surprise if documented.

What do you think?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#302

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#301)

Re: Minimal logical decoding on standbys

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/5/23 8:59 AM, Amit Kapila wrote:

On Mon, Apr 3, 2023 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On further thinking, as such this shouldn't be a problem because all
the WAL records before PARAMETER_CHANGE record will have sufficient
information so that they can get decoded. However, with the current
approach, the subscriber may not even receive the valid records before
PARAMETER_CHANGE record. This is because startup process will
terminate the walsenders while invaliding the slots and after restart
the walsenders will exit because the corresponding slot will be an
invalid slot. So, it is quite possible that walsender was lagging and
wouldn't have sent records before the PARAMETER_CHANGE record making
subscriber never receive those records that it should have received.

Agree that would behave that way.

I don't know whether this is what one would expect.

If one change wal_level to < logical on the primary, he should at least
know that:
"
Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
"
If the doc has been read (as the quote above is coming from 0006).

I think that what is missing is the "when" the slots are invalidated.

Maybe we could change the doc with something among those lines instead?

"
Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.

It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
primary won't be decoded".

I don't know whether this is what one would expect but that should be less of a surprise if documented.

What do you think?

Yeah, I think it is better to document to avoid any surprises if
nobody else sees any problem with it. BTW, another thought that
crosses my mind is that let's not invalidate the slots when the
standby startup process processes parameter_change record and rather
do it when walsender decodes the parameter_change record, if we think
that is safe. I have shared this as this crosses my mind while
thinking about this part of the patch and wanted to validate my
thoughts, we don't need to change even if the idea is valid.

minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */

Do we need to repeat fall through twice in different ways?

--
With Regards,
Amit Kapila.

#303

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#302)

Re: Minimal logical decoding on standbys

On Wed, Apr 5, 2023 at 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */
Do we need to repeat fall through twice in different ways?

Few minor comments on 0003:
========================
1.
+ case XLOG_PARAMETER_CHANGE:
+ {
+ xl_parameter_change *xlrec =
+ (xl_parameter_change *) XLogRecGetData(buf->record);
+
+ /*
+ * If wal_level on primary is reduced to less than logical,
+ * then we want to prevent existing logical slots from being
+ * used. Existing logical slots on standby get invalidated
+ * when this WAL record is replayed; and further, slot
+ * creation fails when the wal level is not sufficient; but
+ * all these operations are not synchronized, so a logical
+ * slot may creep in while the wal_level is being reduced.
+ * Hence this extra check.
+ */
+ if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("logical decoding on standby requires wal_level to be at
least logical on the primary server")));

By looking at this change, it is not very clear that this can occur
only on standby. I understand that on primary, we will not allow
restarting the server after changing wal_level if there is a
pre-existing slot but still this looks a bit odd. Shall we have an
Assert to indicate that this will occur only on standby?

2.
/*
- * Since logical decoding is only permitted on a primary server, we know
- * that the current timeline ID can't be changing any more. If we did this
- * on a standby, we'd have to worry about the values we compute here
- * becoming invalid due to a promotion or timeline change.
+ * Since logical decoding is also permitted on a standby server, we need
+ * to check if the server is in recovery to decide how to get the current
+ * timeline ID (so that it also cover the promotion or timeline change
+ * cases).
  */
+
+ /* make sure we have enough WAL available */
+ flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+ /* the standby could have been promoted, so check if still in recovery */
+ am_cascading_walsender = RecoveryInProgress();

The first part of the comment explains why it is important to check
RecoveryInProgress() and then immediately after that, the patch
invokes WalSndWaitForWal(). It may be better to move the comment after
WalSndWaitForWal() invocation. Also, it will be better to write a
comment as to why you need to do WalSndWaitForWal() before retrieving
the current timeline as previously that was done afterward.

--
With Regards,
Amit Kapila.

#304

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#302)

Re: Minimal logical decoding on standbys

Hi,

On 4/5/23 12:28 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Maybe we could change the doc with something among those lines instead?

"
Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.

It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
primary won't be decoded".

I don't know whether this is what one would expect but that should be less of a surprise if documented.

What do you think?

Yeah, I think it is better to document to avoid any surprises if
nobody else sees any problem with it.

Ack.

BTW, another thought that
crosses my mind is that let's not invalidate the slots when the
standby startup process processes parameter_change record and rather
do it when walsender decodes the parameter_change record, if we think
that is safe. I have shared this as this crosses my mind while
thinking about this part of the patch and wanted to validate my
thoughts, we don't need to change even if the idea is valid.

I think this is a valid idea but I think I do prefer the current one (where the
startup process triggers the invalidations) because:

- I think this is better to invalidate as soon as possible. In case of inactive logical
replication slot (walsenders stopped) it could take time to get "notified". While with the current
approach you'd get notified in the logfile and pg_replication_slots even if walsenders are stopped.

- This is not a "slot" dependent invalidation (as opposed to the xid invalidations case)

- This is "somehow" the same behavior as on the primary: if one change the wal_level to be < logical
then the engine will not start (if logical slot in place). Then what has been decoded is until the time
the engine has been stopped. So if there is walsender lag, you'd not see some records.

minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */
Do we need to repeat fall through twice in different ways?

Do you mean, you'd prefer what was done in v52/0002?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#305

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#304)

Re: Minimal logical decoding on standbys

On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/5/23 12:28 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */
Do we need to repeat fall through twice in different ways?
Do you mean, you'd prefer what was done in v52/0002?

No, I was thinking that instead of two comments, we need one here.
But, now thinking about it, do we really need to fall through in this
case, if so why? Shouldn't this case be handled after
PROCSIG_RECOVERY_CONFLICT_DATABASE?

--
With Regards,
Amit Kapila.

#306

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#295)

Re: Minimal logical decoding on standbys

On Tue, Apr 4, 2023 at 8:33 PM Jeff Davis <pgsql@j-davis.com> wrote:

Perhaps a commit message like:

"For cascading replication, wake up physical walsenders separately from
logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would not have been applied yet, so logical walsenders were
awakened too early."

This sounds great. I think it's very clear about what is being changed
and why. I see that Bertrand already pulled this language into v60.

For comments, I agree that WalSndWakeup() clearly needs a comment
update. The call site in ApplyWalRecord() could also use a comment. You
could add a comment at every call site, but I don't think that's
necessary if there's a good comment over WalSndWakeup().

Right, we don't want to go overboard, but I think putting some of the
text you wrote above for the commit message, or something with a
similar theme, in the comment for WalSndWakeup() would be quite
helpful. We want people to understand why the physical and logical
cases are different.

I agree with you that ApplyWalRecord() is the other place where we
need a good comment. I think the one in v60 needs more word-smithing.
It should probably be a bit more detailed and clear about not only
what we're doing but why we're doing it.

The comment in InitWalSenderSlot() seems like it might be slightly
overdone, but I don't have a big problem with it so if we leave it
as-is that's fine.

Now that I understand what's going on here a bit better, I'm inclined
to think that this patch is basically fine. At least, I don't see any
obvious problem with it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#307

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#303)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/5/23 1:59 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */
Do we need to repeat fall through twice in different ways?
Few minor comments on 0003:
========================
1.
+ case XLOG_PARAMETER_CHANGE:
+ {
+ xl_parameter_change *xlrec =
+ (xl_parameter_change *) XLogRecGetData(buf->record);
+
+ /*
+ * If wal_level on primary is reduced to less than logical,
+ * then we want to prevent existing logical slots from being
+ * used. Existing logical slots on standby get invalidated
+ * when this WAL record is replayed; and further, slot
+ * creation fails when the wal level is not sufficient; but
+ * all these operations are not synchronized, so a logical
+ * slot may creep in while the wal_level is being reduced.
+ * Hence this extra check.
+ */
+ if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("logical decoding on standby requires wal_level to be at
least logical on the primary server")));
By looking at this change, it is not very clear that this can occur
only on standby. I understand that on primary, we will not allow
restarting the server after changing wal_level if there is a
pre-existing slot but still this looks a bit odd. Shall we have an
Assert to indicate that this will occur only on standby?

I think that's a fair point. Adding an Assert and a comment before the
Assert in V61 attached.

2.
/*
- * Since logical decoding is only permitted on a primary server, we know
- * that the current timeline ID can't be changing any more. If we did this
- * on a standby, we'd have to worry about the values we compute here
- * becoming invalid due to a promotion or timeline change.
+ * Since logical decoding is also permitted on a standby server, we need
+ * to check if the server is in recovery to decide how to get the current
+ * timeline ID (so that it also cover the promotion or timeline change
+ * cases).
*/
+
+ /* make sure we have enough WAL available */
+ flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+ /* the standby could have been promoted, so check if still in recovery */
+ am_cascading_walsender = RecoveryInProgress();

Good catch, thanks! done in V61.

Also, it will be better to write a
comment as to why you need to do WalSndWaitForWal() before retrieving
the current timeline as previously that was done afterward.

Agree, done in V61.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v61-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v61-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From 1bdbea718682cce4953310314759863302b0c0ea Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v61 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..3da254ed1f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,28 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v61-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v61-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From 044fa7b75e926153ca665d0ed450bc8352e24b0b Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v61 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 705 ++++++++++++++++++
 7 files changed, 795 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.7% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index dc44a74eb2..9253cd1c18 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27032,6 +27032,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..481e9a47da 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"
 
 /*
  * Backup-related variables.
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..0822012e9d
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,705 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery"),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery"),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"inactiveslot\" because it conflicts with recovery", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating replication slot \"activeslot\" because it conflicts with recovery", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v61-0004-For-cascading-replication-wake-up-physical-walse.patchtext/plain; charset=UTF-8; name=v61-0004-For-cascading-replication-wake-up-physical-walse.patchDownload

From df9b447889ffa739a223c7db615708256dfdf32c Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH v61 4/6] For cascading replication, wake up physical
 walsenders separately from logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Author: Bertrand Drouvot per idea from Jeff Davis and Amit Kapila.
Reviewed-By: Sawada Masahiko, Robert Haas.
---
 src/backend/access/transam/xlog.c           |  6 ++--
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 30 ++++++++++++----
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 40 +++++++++++++++++----
 src/include/replication/walsender.h         | 22 ++++++------
 src/include/replication/walsender_private.h |  3 ++
 7 files changed, 76 insertions(+), 29 deletions(-)
  36.3% src/backend/access/transam/
  47.3% src/backend/replication/
  16.2% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 779f5c3711..70ac8fc33b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5773,7 +5773,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b5..f3fb92c8f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..186e4ef600 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * Wakeup walsenders:
+	 *
+	 * On the standby, the WAL is flushed first (which will only wake up
+	 * physical walsenders) and then applied, which will only wake up logical
+	 * walsenders.
+	 *
+	 * Indeed, logical walsenders on standby can't decode and send data until
+	 * it's been applied.
+	 *
+	 * Physical walsenders don't need to be waked up during replay unless
+	 * cascading replication is allowed and time line change occured (so that
+	 * they can notice that they are on a new time line).
+	 *
+	 * That's why the wake up conditions are for:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - logical walsenders in case of new time line or recovery is in progress
+	 *  (logical decoding on standby).
+	 */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+				 switchedTLI || RecoveryInProgress());
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1982,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d..d2aa93734c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, !RecoveryInProgress());
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d908d1de2..5c68ebb79e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2628,6 +2628,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3310,30 +3327,39 @@ WalSndShmemInit(void)
 }
 
 /*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done, because:
+ * - physical walsenders can't send data until it's been flushed
+ * - logical walsenders on standby can't decode and send data until it's been
+ * applied
+ *
+ * For cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
  *
  * This will be called inside critical sections, so throwing an error is not
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+							  (logical && kind == REPLICATION_KIND_LOGICAL)))
 			SetLatch(latch);
 	}
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..9df7e50f94 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
-	} while (0)
+static inline void
+WalSndWakeupProcessRequests(bool physical, bool logical)
+{
+	if (wake_wal_senders)
+	{
+		wake_wal_senders = false;
+		if (max_wal_senders > 0)
+			WalSndWakeup(physical, logical);
+	}
+}
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..ff25aa70a8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
-- 
2.34.1

v61-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v61-0003-Allow-logical-decoding-on-standby.patchDownload

From eeafb43e73060500656e4f155e0451e35c0d2b42 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v61 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 30 +++++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 48 ++++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 123 insertions(+), 61 deletions(-)
   4.3% src/backend/access/transam/
  38.9% src/backend/replication/logical/
  55.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 10085aa0d6..779f5c3711 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8352dbf5df 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,39 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				{
+					/*
+					 * This can occur only on a standby, as a primary would
+					 * not allow to restart after changing wal_level < logical
+					 * if there is pre-existing logical slot.
+					 */
+					Assert(RecoveryInProgress());
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				}
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index aacb75bebf..efae9588f3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..2d908d1de2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,34 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
+
+	/*
+	 * Make sure we have enough WAL available before retrieving the current
+	 * timeline. This is needed to determine am_cascading_walsender accurately
+	 * which is needed to determine the current timeline.
+	 */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +948,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3084,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3182,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v61-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v61-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From d238d118b1cabe2b99f4c65fd59cc1957cbb006f Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v61 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 12 +++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 57 insertions(+), 9 deletions(-)
  33.7% doc/src/sgml/
   8.1% src/backend/catalog/
  13.1% src/backend/replication/
   5.9% src/backend/utils/activity/
   5.6% src/backend/utils/adt/
  24.6% src/include/catalog/
   6.9% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fd0ffbb1e0..9fc585b9e7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6b098234f8..c25067d06d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..6473c73eca 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,16 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (LogicalReplicationSlotIsInvalid(slot))
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d921..fa3d326d86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v61-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v61-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From 13c65864c30cd9c57fa6204708543fcfb8435c69 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v61 1/6] Handle logical slot conflicts on standby.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy, Amit Kapila, Álvaro Herrera
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  21 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 189 ++++++++++++++----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  13 +-
 src/backend/tcop/postgres.c                   |   9 +
 src/include/replication/slot.h                |  55 ++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 274 insertions(+), 51 deletions(-)
   7.5% src/backend/access/transam/
   5.7% src/backend/replication/logical/
  61.6% src/backend/replication/
   5.3% src/backend/storage/ipc/
   5.0% src/backend/
  13.8% src/include/replication/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a4..371e855683 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9009,6 +9011,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..10085aa0d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,7 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7250,7 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid, NULL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..575a047e53 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both effective_xmin and catalog_xmin are
+		 * valid. Avoid "cannot get changes" wording in this errmsg because
+		 * that'd be confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..aacb75bebf 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool check_on_xid,
+										  int pid, NameData slotname,
+										  TransactionId *xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,92 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	if (check_on_xid)
+	{
+		if (!terminating)
+		{
+			initStringInfo(&err_msg);
+			appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with recovery"),
+							 NameStr(slotname));
+		}
+
+		if (TransactionIdIsValid(*xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		if (!terminating)
+		{
+			initStringInfo(&err_msg);
+			appendStringInfo(&err_msg, _("invalidating obsolete replication slot \"%s\""),
+							 NameStr(slotname));
+		}
+
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	ereport(LOG,
+			terminating ? errmsg("terminating process %d to release replication slot \"%s\"", pid, NameStr(slotname)) :
+			errmsg_internal("%s", err_msg.data),
+			errdetail_internal("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	if (!terminating)
+		pfree(err_msg.data);
+
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		check_on_xid;
+
+	check_on_xid = xid ? true : false;
 
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1333,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, &oldestLSN))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1365,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (xid)
+			{
+				s->effective_xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1407,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, check_on_xid, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (check_on_xid)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1448,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, check_on_xid, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1465,36 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
  *
- * Returns true when any slot have got invalidated.
+ * Obsolete case (aka xid is NULL):
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ *  Logical replication slot case:
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1505,35 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
 		{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;
 		}
+
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated, xid))
+			goto restart;
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
 		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return invalidated;
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..c62245afc7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +504,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +523,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1486,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50..25e0de4e0f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3143,6 +3146,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				InterruptPending = true;
 				break;
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				RecoveryConflictPending = true;
+				QueryCancelPending = true;
+				InterruptPending = true;
+				break;
+
 			default:
 				elog(FATAL, "unrecognized conflict mode: %d",
 					 (int) reason);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..914b6aebc3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,58 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->effective_xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_effective_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_effective_xmin = s->effective_xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_effective_xmin) && TransactionIdPrecedesOrEquals(slot_effective_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
+{
+	return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
+
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +267,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid, TransactionId *xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
@@ -227,5 +279,6 @@ extern void CheckPointReplicationSlots(void);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid, char *reason);
 
 #endif							/* SLOT_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#308

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#305)

Re: Minimal logical decoding on standbys

Hi,

On 4/5/23 3:15 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
On 4/5/23 12:28 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */
Do we need to repeat fall through twice in different ways?
Do you mean, you'd prefer what was done in v52/0002?
No, I was thinking that instead of two comments, we need one here.
But, now thinking about it, do we really need to fall through in this
case, if so why? Shouldn't this case be handled after
PROCSIG_RECOVERY_CONFLICT_DATABASE?

Indeed, thanks! Done in V61 posted up-thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#309

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Robert Haas (#306)

Re: Minimal logical decoding on standbys

Hi,

On 4/5/23 4:24 PM, Robert Haas wrote:

On Tue, Apr 4, 2023 at 8:33 PM Jeff Davis <pgsql@j-davis.com> wrote:

For comments, I agree that WalSndWakeup() clearly needs a comment
update. The call site in ApplyWalRecord() could also use a comment. You
could add a comment at every call site, but I don't think that's
necessary if there's a good comment over WalSndWakeup().

Right, we don't want to go overboard, but I think putting some of the
text you wrote above for the commit message, or something with a
similar theme, in the comment for WalSndWakeup() would be quite
helpful. We want people to understand why the physical and logical
cases are different.

Gave it a try in V61 posted up-thread.

I agree with you that ApplyWalRecord() is the other place where we
need a good comment. I think the one in v60 needs more word-smithing.
It should probably be a bit more detailed and clear about not only
what we're doing but why we're doing it.

Gave it a try in V61 posted up-thread.

Now that I understand what's going on here a bit better, I'm inclined
to think that this patch is basically fine. At least, I don't see any
obvious problem with it.

Thanks for the review and feedback!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#310

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#307)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-05 17:56:14 +0200, Drouvot, Bertrand wrote:

@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));

+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}

I mentioned this before, but I still don't understand why
InvalidateObsoleteReplicationSlots() accepts ConflictHorizon as a
pointer. It's not even modified, as far as I can see?

/*
* Report shared-memory space needed by ReplicationSlotsShmemInit.
*/
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
SpinLockAcquire(&s->mutex);
effective_xmin = s->effective_xmin;
effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
SpinLockRelease(&s->mutex);

I don't understand why we need to have two different functions for this.

/* invalidated slots need not apply */
@@ -1225,28 +1231,92 @@ ReplicationSlotReserveWal(void)
}
}

+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	if (check_on_xid)
+	{
+		if (!terminating)
+		{
+			initStringInfo(&err_msg);
+			appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with recovery"),
+							 NameStr(slotname));

I still don't think the main error message should differ between invalidating
a slot due recovery and max_slot_wal_keep_size.

+
/*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
*
* Returns whether ReplicationSlotControlLock was released in the interim (and
* in that case we're not holding the lock at return, otherwise we are).
*
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)

What's the point of making this specific to "obsolete slots"?

* This is inherently racy, because we release the LWLock
* for syscalls, so caller must restart if we return true.
*/
static bool
InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
{
int			last_signaled_pid = 0;
bool		released_lock = false;
+	bool		check_on_xid;
+
+	check_on_xid = xid ? true : false;

for (;;)
{
XLogRecPtr restart_lsn;
+
NameData slotname;
int active_pid = 0;

@@ -1263,19 +1333,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
* Check if the slot needs to be invalidated. If it needs to be
* invalidated, and is not currently acquired, acquire it and mark it
* as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
*/
SpinLockAcquire(&s->mutex);

restart_lsn = s->data.restart_lsn;

/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
*/
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, &oldestLSN))

DoNotInvalidateSlot() seems odd to me, and makes the code harder to
understand. I'd make it something like:

if (!SlotIsInvalid(s) && (
LogicalSlotConflictsWith(s, xid) ||
SlotConflictsWithLSN(s, lsn)))

/*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.

I don't like that this spreads "obsolete slots" around further - it's very
unspecific. A logical slot that needs to be removed due to an xid conflict is
just as obsolete as one that needs to be removed due to max_slot_wal_keep_size.

I'd rephrase this to be about required resources getting removed or such, one
case of that is WAL another case is xids.

restart:
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1505,35 @@ restart:
if (!s->in_use)
continue;

-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;

ISTM that this should be in InvalidatePossiblyObsoleteSlot().

/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
*/
if (invalidated)
{
ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
}

Why make this conditional? If we invalidated a logical slot, we also don't
require as much WAL anymore, no?

@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
}

Hm. Is there a reason for doing this before resolving conflicts with existing
sessions?

Another issue: ResolveRecoveryConflictWithVirtualXIDs() takes
WaitExceedsMaxStandbyDelay() into account, but
InvalidateObsoleteReplicationSlots() does not. I think that's ok, because the
setup should prevent this case from being reached in normal paths, but at
least there should be a comment documenting this.

+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_effective_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_effective_xmin = s->effective_xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_effective_xmin) && TransactionIdPrecedesOrEquals(slot_effective_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}

return -ETOOMANYPARENS

+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
+{
+	return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
+
+}

See above for some more comments. But please don't accept stuff via pointer if
you don't have a reason for it. There's no reason for it for xid and oldestLSN
afaict.

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..186e4ef600 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
SpinLockRelease(&XLogRecoveryCtl->info_lck);

+	/*
+	 * Wakeup walsenders:
+	 *
+	 * On the standby, the WAL is flushed first (which will only wake up
+	 * physical walsenders) and then applied, which will only wake up logical
+	 * walsenders.
+	 * Indeed, logical walsenders on standby can't decode and send data until
+	 * it's been applied.
+	 *
+	 * Physical walsenders don't need to be waked up during replay unless

s/waked/woken/

+	 * cascading replication is allowed and time line change occured (so that
+	 * they can notice that they are on a new time line).
+	 *
+	 * That's why the wake up conditions are for:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - logical walsenders in case of new time line or recovery is in progress
+	 *  (logical decoding on standby).
+	 */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+				 switchedTLI || RecoveryInProgress());

I don't think it's possible to get here without RecoveryInProgress() being
true. So we don't need that condition.

@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, !RecoveryInProgress());

Same comment as earlier.

/* Report XLOG streaming progress in PS display */
if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d908d1de2..5c68ebb79e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2628,6 +2628,23 @@ InitWalSenderSlot(void)
walsnd->sync_standby_priority = 0;
walsnd->latch = &MyProc->procLatch;
walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */

Doesn't that mean we'll wake up logical walsenders even if they're doing
normal query processing?

+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
SpinLockRelease(&walsnd->mutex);
/* don't need the lock anymore */
MyWalSnd = (WalSnd *) walsnd;
@@ -3310,30 +3327,39 @@ WalSndShmemInit(void)
}

/*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done, because:
+ * - physical walsenders can't send data until it's been flushed
+ * - logical walsenders on standby can't decode and send data until it's been
+ * applied
+ *
+ * For cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
*
* This will be called inside critical sections, so throwing an error is not
* advisable.
*/
void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
{
int			i;

for (i = 0; i < max_wal_senders; i++)
{
Latch *latch;
+ ReplicationKind kind;
WalSnd *walsnd = &WalSndCtl->walsnds[i];

-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
SpinLockAcquire(&walsnd->mutex);
latch = walsnd->latch;
+		kind = walsnd->kind;
SpinLockRelease(&walsnd->mutex);

-		if (latch != NULL)
+		if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+							  (logical && kind == REPLICATION_KIND_LOGICAL)))
SetLatch(latch);
}
}

I'd consider rewriting this to something like:

if (latch == NULL)
continue;

if ((physical && kind == REPLICATION_KIND_PHYSICAL)) ||
(logical && kind == REPLICATION_KIND_LOGICAL)
SetLatch(latch)

Greetings,

Andres Freund

#311

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#304)

Re: Minimal logical decoding on standbys

On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/5/23 12:28 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Maybe we could change the doc with something among those lines instead?

"
Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.

It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
primary won't be decoded".

I don't know whether this is what one would expect but that should be less of a surprise if documented.

What do you think?

Yeah, I think it is better to document to avoid any surprises if
nobody else sees any problem with it.

Ack.

This doesn't seem to be addressed in the latest version. And today, I
think I see one more point about this doc change:
+    <para>
+     A logical replication slot can also be created on a hot standby.
To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot
between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but
only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on
primary is reduced to
+     less than 'logical'.

If hot_standby_feedback is not set then can logical decoding on
standby misbehave? If so, that is not very clear from this doc change
if that is acceptable. One scenario where I think it can misbehave is
if applying WAL records generated after changing wal_level from
'logical' to 'replica' physically removes catalog tuples that could be
referenced by the logical decoding on the standby. Now, as mentioned
in patch 0003's comment in decode.c that it is possible that some
slots may creep even after we invalidate the slots on parameter
change, so while decoding using that slot if some required catalog
tuple has been removed by physical replication then the decoding can
misbehave even before reaching XLOG_PARAMETER_CHANGE record.

--
With Regards,
Amit Kapila.

#312

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#308)

Re: Minimal logical decoding on standbys

On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/5/23 3:15 PM, Amit Kapila wrote:
On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
On 4/5/23 12:28 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */
Do we need to repeat fall through twice in different ways?
Do you mean, you'd prefer what was done in v52/0002?
No, I was thinking that instead of two comments, we need one here.
But, now thinking about it, do we really need to fall through in this
case, if so why? Shouldn't this case be handled after
PROCSIG_RECOVERY_CONFLICT_DATABASE?
Indeed, thanks! Done in V61 posted up-thread.

After this, I think for backends that have active slots, it would
simply cancel the current query. Will that be sufficient? Because we
want the backend process should exit and release the slot so that the
startup process can mark it invalid. For walsender, an ERROR will lead
to its exit, so that is fine. If this understanding is correct, then
if 'am_cascading_walsender' is false, we should set ProcDiePending
apart from other parameters. Sorry, I haven't tested this, so I could
be wrong here. Also, it seems you have removed the checks related to
slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
used for logical slots? If so, do you think an Assert would make
sense?

Another comment on 0001.
extern void CheckSlotRequirements(void);
extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
TransactionId xid, char *reason);

This doesn't seem to be called from anywhere.

--
With Regards,
Amit Kapila.

#313

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#312)

Re: Minimal logical decoding on standbys

On Thu, Apr 6, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Another comment on 0001.
extern void CheckSlotRequirements(void);
extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
TransactionId xid, char *reason);

This doesn't seem to be called from anywhere.

Few other comments:
==================
0004
1.
+ *  - physical walsenders in case of new time line and cascade
+ *  replication is allowed.
+ *  - logical walsenders in case of new time line or recovery is in progress
+ *  (logical decoding on standby).
+ */
+ WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+ switchedTLI || RecoveryInProgress());

Do we need AllowCascadeReplication() check specifically for physical
walsenders? I think this should be true for both physical and logical
walsenders.

0005
2.
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"

The header includes should be in alphabetical order.

--
With Regards,
Amit Kapila.

#314

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#311)

Re: Minimal logical decoding on standbys

On Thu, Apr 6, 2023 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

This doesn't seem to be addressed in the latest version. And today, I
think I see one more point about this doc change:
+    <para>
+     A logical replication slot can also be created on a hot standby.
To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot
between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but
only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on
primary is reduced to
+     less than 'logical'.
If hot_standby_feedback is not set then can logical decoding on
standby misbehave? If so, that is not very clear from this doc change
if that is acceptable. One scenario where I think it can misbehave is
if applying WAL records generated after changing wal_level from
'logical' to 'replica' physically removes catalog tuples that could be
referenced by the logical decoding on the standby. Now, as mentioned
in patch 0003's comment in decode.c that it is possible that some
slots may creep even after we invalidate the slots on parameter
change, so while decoding using that slot if some required catalog
tuple has been removed by physical replication then the decoding can
misbehave even before reaching XLOG_PARAMETER_CHANGE record.

Thinking some more on this, I think such a slot won't decode any other
records. During CreateInitDecodingContext->ReplicationSlotReserveWal,
for standby's, we use lastReplayedEndRecPtr as restart_lsn. This
should be a record before parameter_change record in the above
scenario. So, ideally, the first record to decode by such a walsender
should be parameter_change which will anyway error out. So, this
shouldn't be a problem.

--
With Regards,
Amit Kapila.

#315

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#310)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/5/23 8:28 PM, Andres Freund wrote:

Hi,

On 2023-04-05 17:56:14 +0200, Drouvot, Bertrand wrote:

@@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));

+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+		{
+			TransactionId ConflictHorizon = InvalidTransactionId;
+
+			InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
+		}

I mentioned this before,

Sorry, I probably missed it.

but I still don't understand why
InvalidateObsoleteReplicationSlots() accepts ConflictHorizon as a
pointer. It's not even modified, as far as I can see?

The initial goal was to be able to check if
xid pointer was NULL and also if *xid was a valid xid or not. So basically being able to
do 3 checks with the same parameter.

That's how we decided wether or not we are in the wal_level < logical on primary conflict case in ReportTerminationInvalidation().

I agree that passing a pointer is not the best approach (as there is a "risk" of modifying the value it points to),
so adding an extra bool to InvalidateObsoleteReplicationSlots() in attached V62 instead.

Also replacing the InvalidXLogRecPtr by 0 as it does sound odd to use "InvalidXLogRecPtr"
naming for a XLogSegNo.

/*
* Report shared-memory space needed by ReplicationSlotsShmemInit.
*/
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
SpinLockAcquire(&s->mutex);
effective_xmin = s->effective_xmin;
effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
SpinLockRelease(&s->mutex);

I don't understand why we need to have two different functions for this.

LogicalReplicationSlotIsInvalid() has been created to provide a different error message
than in ".....because it exceeded the maximum reserved size" in StartLogicalReplication()
and "This slot has never previously reserved WAL" in pg_logical_slot_get_changes_guts().

So basically to distinguish with the max_slot_wal_keep_size related messages.

/* invalidated slots need not apply */
@@ -1225,28 +1231,92 @@ ReplicationSlotReserveWal(void)
}
}

+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId *xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_msg;
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	if (check_on_xid)
+	{
+		if (!terminating)
+		{
+			initStringInfo(&err_msg);
+			appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with recovery"),
+							 NameStr(slotname));

I still don't think the main error message should differ between invalidating
a slot due recovery and max_slot_wal_keep_size.

Okay. I gave a second thought and I agree that "obsolete" does also make
sense for the xid conflict case. So, done that way in V62.

+
/*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
*
* Returns whether ReplicationSlotControlLock was released in the interim (and
* in that case we're not holding the lock at return, otherwise we are).
*
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)

What's the point of making this specific to "obsolete slots"?

There is no. Should be coming from a previous version/experiment.
Removed in V62, thanks!

* This is inherently racy, because we release the LWLock
* for syscalls, so caller must restart if we return true.
*/
static bool
InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   bool *invalidated, TransactionId *xid)
{
int			last_signaled_pid = 0;
bool		released_lock = false;
+	bool		check_on_xid;
+
+	check_on_xid = xid ? true : false;

for (;;)
{
XLogRecPtr restart_lsn;
+
NameData slotname;
int active_pid = 0;

@@ -1263,19 +1333,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
* Check if the slot needs to be invalidated. If it needs to be
* invalidated, and is not currently acquired, acquire it and mark it
* as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
*/
SpinLockAcquire(&s->mutex);

restart_lsn = s->data.restart_lsn;

/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
*/
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, &oldestLSN))

DoNotInvalidateSlot() seems odd to me, and makes the code harder to
understand. I'd make it something like:

if (!SlotIsInvalid(s) && (
LogicalSlotConflictsWith(s, xid) ||
SlotConflictsWithLSN(s, lsn)))

I think that's a matter of taste (having a single function was suggested
by Amit up-thread).

I think I prefer having one single function as it seems to me easier to
understand if we want to check on xid or not.

/*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
I don't like that this spreads "obsolete slots" around further - it's very
unspecific. A logical slot that needs to be removed due to an xid conflict is
just as obsolete as one that needs to be removed due to max_slot_wal_keep_size.

I'd rephrase this to be about required resources getting removed or such, one
case of that is WAL another case is xids.

Agree. Re-worded in V62.

restart:
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,21 +1505,35 @@ restart:
if (!s->in_use)
continue;

-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (xid)
{
-			/* if the lock was released, start from scratch */
-			goto restart;
+			/* we are only dealing with *logical* slot conflicts */
+			if (!SlotIsLogical(s))
+				continue;
+
+			/*
+			 * not the database of interest and we don't want all the
+			 * database, skip
+			 */
+			if (s->data.database != dboid && TransactionIdIsValid(*xid))
+				continue;

ISTM that this should be in InvalidatePossiblyObsoleteSlot().

Agree, done in V62.

/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
*/
if (invalidated)
{
ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
+		if (!xid)
+			ReplicationSlotsComputeRequiredLSN();
}

Why make this conditional? If we invalidated a logical slot, we also don't
require as much WAL anymore, no?

Agree, done in V62.

@@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
}

Hm. Is there a reason for doing this before resolving conflicts with existing
sessions?

Do you mean, you'd prefer to InvalidateObsoleteReplicationSlots() before ResolveRecoveryConflictWithVirtualXIDs()?

Another issue: ResolveRecoveryConflictWithVirtualXIDs() takes
WaitExceedsMaxStandbyDelay() into account, but
InvalidateObsoleteReplicationSlots() does not.

humm, good point.

I think that's ok, because the
setup should prevent this case from being reached in normal paths, but at
least there should be a comment documenting this.

I started to add the comment InvalidateObsoleteReplicationSlots() but I'm not
sure what you mean by "the setup should prevent this case from being reached in normal paths"
(so I let "XXXX" in the comment for now).

Did you mean hsf and a physical slot between the primary and the standby should be in place?
Could you please elaborate?

+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_effective_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_effective_xmin = s->effective_xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (((TransactionIdIsValid(slot_effective_xmin) && TransactionIdPrecedesOrEquals(slot_effective_xmin, xid)) ||
+			 (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
+}

return -ETOOMANYPARENS

gave it a try to make it better in V62.

+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
+{
+	return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
+{
+	if (xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
+
+}

See above for some more comments. But please don't accept stuff via pointer if
you don't have a reason for it. There's no reason for it for xid and oldestLSN
afaict.

Agree that there is no reason for oldestLSN. Changing in V62.
As far the xid, I explained why I used a pointer above but find a way to remove the need
in V62 (as explained above).

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..186e4ef600 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
SpinLockRelease(&XLogRecoveryCtl->info_lck);

+	/*
+	 * Wakeup walsenders:
+	 *
+	 * On the standby, the WAL is flushed first (which will only wake up
+	 * physical walsenders) and then applied, which will only wake up logical
+	 * walsenders.
+	 * Indeed, logical walsenders on standby can't decode and send data until
+	 * it's been applied.
+	 *
+	 * Physical walsenders don't need to be waked up during replay unless

s/waked/woken/

Thans, fixed.

+	 * cascading replication is allowed and time line change occured (so that
+	 * they can notice that they are on a new time line).
+	 *
+	 * That's why the wake up conditions are for:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - logical walsenders in case of new time line or recovery is in progress
+	 *  (logical decoding on standby).
+	 */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+				 switchedTLI || RecoveryInProgress());

I don't think it's possible to get here without RecoveryInProgress() being
true. So we don't need that condition.

Right, so using "true" instead as we don't want to rely only on a time line change
for a logical walsender.

@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, !RecoveryInProgress());

Same comment as earlier.

done.

/* Report XLOG streaming progress in PS display */
if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d908d1de2..5c68ebb79e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2628,6 +2628,23 @@ InitWalSenderSlot(void)
walsnd->sync_standby_priority = 0;
walsnd->latch = &MyProc->procLatch;
walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */

Doesn't that mean we'll wake up logical walsenders even if they're doing
normal query processing?

I'm not following what you mean here.

+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
SpinLockRelease(&walsnd->mutex);
/* don't need the lock anymore */
MyWalSnd = (WalSnd *) walsnd;
@@ -3310,30 +3327,39 @@ WalSndShmemInit(void)
}

/*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done, because:
+ * - physical walsenders can't send data until it's been flushed
+ * - logical walsenders on standby can't decode and send data until it's been
+ * applied
+ *
+ * For cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
*
* This will be called inside critical sections, so throwing an error is not
* advisable.
*/
void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
{
int			i;

for (i = 0; i < max_wal_senders; i++)
{
Latch *latch;
+ ReplicationKind kind;
WalSnd *walsnd = &WalSndCtl->walsnds[i];

-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
SpinLockAcquire(&walsnd->mutex);
latch = walsnd->latch;
+		kind = walsnd->kind;
SpinLockRelease(&walsnd->mutex);

-		if (latch != NULL)
+		if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+							  (logical && kind == REPLICATION_KIND_LOGICAL)))
SetLatch(latch);
}
}

I'd consider rewriting this to something like:

if (latch == NULL)
continue;

if ((physical && kind == REPLICATION_KIND_PHYSICAL)) ||
(logical && kind == REPLICATION_KIND_LOGICAL)
SetLatch(latch)

Yeah better, done.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v62-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v62-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From feae28fecc15316f2b4c5561ecbfcc1b120a5dd9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v62 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..8651024b8a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,33 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Then, the
+     primary may delete system catalog rows that could be needed by the logical
+     decoding on the standby (as it does not know about the catalog_xmin on the
+     standby). Existing logical slots on standby also get invalidated if wal_level
+     on primary is reduced to less than 'logical'. This is done as soon as the
+     standby detects such a change in the WAL stream. It means, that for walsenders
+     that are lagging (if any), some WAL records up to the wal_level parameter change
+     on the primary won't be decoded.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v62-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v62-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From e0868b52b35b59cb8f35043a64142eb9e2acd747 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v62 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 705 ++++++++++++++++++
 7 files changed, 795 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.6% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index dc44a74eb2..9253cd1c18 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27032,6 +27032,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..36a309b54c 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -31,6 +31,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..a4f9a3c972
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,705 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 67;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\""),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\""),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
-- 
2.34.1

v62-0004-For-cascading-replication-wake-up-physical-walse.patchtext/plain; charset=UTF-8; name=v62-0004-For-cascading-replication-wake-up-physical-walse.patchDownload

From 209132b79f85c899d1c3edb888f5d13f06fe9728 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH v62 4/6] For cascading replication, wake up physical
 walsenders separately from logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Author: Bertrand Drouvot per idea from Jeff Davis and Amit Kapila.
Reviewed-By: Sawada Masahiko, Robert Haas.
---
 src/backend/access/transam/xlog.c           |  6 +--
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 28 +++++++++++---
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 43 +++++++++++++++++----
 src/include/replication/walsender.h         | 22 +++++------
 src/include/replication/walsender_private.h |  3 ++
 7 files changed, 77 insertions(+), 29 deletions(-)
  34.3% src/backend/access/transam/
  48.7% src/backend/replication/
  16.8% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68dfb0344c..caeffc5860 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5773,7 +5773,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b5..f3fb92c8f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..cc4f7b5302 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,28 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * Wakeup walsenders:
+	 *
+	 * On the standby, the WAL is flushed first (which will only wake up
+	 * physical walsenders) and then applied, which will only wake up logical
+	 * walsenders.
+	 *
+	 * Indeed, logical walsenders on standby can't decode and send data until
+	 * it's been applied.
+	 *
+	 * Physical walsenders don't need to be wakon up during replay unless
+	 * cascading replication is allowed and time line change occured (so that
+	 * they can notice that they are on a new time line).
+	 *
+	 * That's why the wake up conditions are for:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - always true for logical walsenders.
+	 */
+	WalSndWakeup(switchedTLI && AllowCascadeReplication(), true);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1980,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d..feff709435 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, false);
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d908d1de2..97990e1827 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2628,6 +2628,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3310,30 +3327,42 @@ WalSndShmemInit(void)
 }
 
 /*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done, because:
+ * - physical walsenders can't send data until it's been flushed
+ * - logical walsenders on standby can't decode and send data until it's been
+ * applied
+ *
+ * For cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
  *
  * This will be called inside critical sections, so throwing an error is not
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch == NULL)
+			continue;
+
+		if ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+			(logical && kind == REPLICATION_KIND_LOGICAL))
 			SetLatch(latch);
 	}
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..9df7e50f94 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
-	} while (0)
+static inline void
+WalSndWakeupProcessRequests(bool physical, bool logical)
+{
+	if (wake_wal_senders)
+	{
+		wake_wal_senders = false;
+		if (max_wal_senders > 0)
+			WalSndWakeup(physical, logical);
+	}
+}
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..ff25aa70a8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
-- 
2.34.1

v62-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v62-0003-Allow-logical-decoding-on-standby.patchDownload

From 2300edf9185c0f09eaf5afb8d3685ad7a84467c8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v62 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 30 +++++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 48 ++++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 123 insertions(+), 61 deletions(-)
   4.3% src/backend/access/transam/
  38.9% src/backend/replication/logical/
  55.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 713b61a9bf..68dfb0344c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8352dbf5df 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,39 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				{
+					/*
+					 * This can occur only on a standby, as a primary would
+					 * not allow to restart after changing wal_level < logical
+					 * if there is pre-existing logical slot.
+					 */
+					Assert(RecoveryInProgress());
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				}
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 376d453374..b9b16b9191 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..2d908d1de2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,34 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
+
+	/*
+	 * Make sure we have enough WAL available before retrieving the current
+	 * timeline. This is needed to determine am_cascading_walsender accurately
+	 * which is needed to determine the current timeline.
+	 */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +948,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3084,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3182,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v62-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v62-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From f55a363f6743ac477a4379170394a804acfa0811 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v62 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 12 +++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 57 insertions(+), 9 deletions(-)
  33.7% doc/src/sgml/
   8.1% src/backend/catalog/
  13.1% src/backend/replication/
   5.9% src/backend/utils/activity/
   5.6% src/backend/utils/adt/
  24.6% src/include/catalog/
   6.9% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bce9ae4661..fa3b0f810c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6b098234f8..c25067d06d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..6473c73eca 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,16 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (LogicalReplicationSlotIsInvalid(slot))
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e79b8a34eb..5e8b04d21b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v62-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v62-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From af4b8fa41dfd208deb4596b9d2d053acdb91055a Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v62 1/6] Handle logical slot conflicts on standby.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy, Amit Kapila, Álvaro Herrera
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  20 ++-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 170 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  14 +-
 src/backend/tcop/postgres.c                   |  18 ++
 src/include/replication/slot.h                |  61 ++++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 270 insertions(+), 51 deletions(-)
   7.5% src/backend/access/transam/
   5.6% src/backend/replication/logical/
  58.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   4.8% src/backend/tcop/
  14.2% src/include/replication/
   3.5% src/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a4..371e855683 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8717,6 +8717,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8888,6 +8889,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9009,6 +9011,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..713b61a9bf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
+										   InvalidTransactionId, false))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
+										   InvalidTransactionId, false))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7965,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateObsoleteReplicationSlots(0, InvalidOid,
+											   InvalidTransactionId, true);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..575a047e53 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both effective_xmin and catalog_xmin are
+		 * valid. Avoid "cannot get changes" wording in this errmsg because
+		 * that'd be confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..376d453374 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool check_on_xid,
+										  int pid, NameData slotname,
+										  TransactionId xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,85 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	if (check_on_xid)
+	{
+		if (TransactionIdIsValid(xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	ereport(LOG,
+			terminating ? errmsg("terminating process %d to release replication slot \"%s\"", pid, NameStr(slotname)) :
+			errmsg("invalidating obsolete replication slot \"%s\"", NameStr(slotname)),
+			errdetail_internal("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   Oid dboid, bool *invalidated, TransactionId xid,
+							   bool check_on_xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
 
+	if (check_on_xid)
+	{
+		/* we are only dealing with *logical* slot conflicts */
+		if (!SlotIsLogical(s))
+			return false;
+
+		/*
+		 * not the database of interest and we don't want all the
+		 * database, skip
+		 */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			return false;
+	}
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1326,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, oldestLSN, check_on_xid))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1358,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (check_on_xid)
+			{
+				s->effective_xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1400,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, check_on_xid, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (check_on_xid)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1441,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, check_on_xid, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1458,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots.
+ *
+ * WAL case (aka check_on_xid is false):
+ *
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * Returns true when any slot have got invalidated.
+ * Xid case (aka check_on_xid is true):
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated. Note that WaitExceedsMaxStandbyDelay() is not taken into
+ *	 account here (as opposed to ResolveRecoveryConflictWithVirtualXIDs()): XXXX
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
+								   TransactionId xid, bool check_on_xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!check_on_xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,16 +1501,15 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
-		{
-			/* if the lock was released, start from scratch */
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, dboid, &invalidated, xid, check_on_xid))
 			goto restart;
-		}
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..a23220cae7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,10 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(0, locator.dbOid,
+										   snapshotConflictHorizon, true);
 }
 
 /*
@@ -499,6 +505,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +524,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1487,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50..4ec64b0a4a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3143,6 +3146,21 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				InterruptPending = true;
 				break;
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/* This signal is only used for logical sloti, sanity check */
+				Assert(MyReplicationSlot && SlotIsLogical(MyReplicationSlot));
+				RecoveryConflictPending = true;
+				QueryCancelPending = true;
+				InterruptPending = true;
+
+				/*
+				 * Normal backends should exit, so that the startup process
+				 * can mark the slot invalid.
+				 */
+				if (!am_cascading_walsender)
+					ProcDiePending = true;
+				break;
+
 			default:
 				elog(FATAL, "unrecognized conflict mode: %d",
 					 (int) reason);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..f2838022e5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,64 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->effective_xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+TransactionIdIsValidPrecedesOrEquals(TransactionId xid1, TransactionId xid2)
+{
+	return (TransactionIdIsValid(xid1) && TransactionIdPrecedesOrEquals(xid1, xid2));
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_effective_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_effective_xmin = s->effective_xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (TransactionIdIsValidPrecedesOrEquals(slot_effective_xmin, xid) ||
+			TransactionIdIsValidPrecedesOrEquals(slot_catalog_xmin, xid));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId xid)
+{
+	return (TransactionIdIsValid(xid) && !LogicalReplicationSlotXidsConflict(s, xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId xid, XLogRecPtr oldestLSN, bool check_on_xid)
+{
+	if (check_on_xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, oldestLSN));
+
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +273,8 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
+											   TransactionId xid, bool check_on_xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#316

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#312)

Re: Minimal logical decoding on standbys

Hi,

On 4/6/23 8:40 AM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand

After this, I think for backends that have active slots, it would
simply cancel the current query. Will that be sufficient? Because we
want the backend process should exit and release the slot so that the
startup process can mark it invalid.
For walsender, an ERROR will lead
to its exit, so that is fine. If this understanding is correct, then
if 'am_cascading_walsender' is false, we should set ProcDiePending
apart from other parameters. Sorry, I haven't tested this, so I could
be wrong here.

Oops my bad. You are fully, right. Fixed in V62 posted up-thread

Also, it seems you have removed the checks related to
slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
used for logical slots? If so, do you think an Assert would make
sense?

Yes, indeed adding an Assert makes sense: done in V62 posted up-thread.

Another comment on 0001.
extern void CheckSlotRequirements(void);
extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
TransactionId xid, char *reason);

This doesn't seem to be called from anywhere.

Good catch, removed in V62 posted up-thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#317

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#313)

Re: Minimal logical decoding on standbys

Hi,

On 4/6/23 11:55 AM, Amit Kapila wrote:

On Thu, Apr 6, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Another comment on 0001.
extern void CheckSlotRequirements(void);
extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
TransactionId xid, char *reason);

This doesn't seem to be called from anywhere.
Few other comments:
==================
0004
1.
+ *  - physical walsenders in case of new time line and cascade
+ *  replication is allowed.
+ *  - logical walsenders in case of new time line or recovery is in progress
+ *  (logical decoding on standby).
+ */
+ WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+ switchedTLI || RecoveryInProgress());
Do we need AllowCascadeReplication() check specifically for physical
walsenders? I think this should be true for both physical and logical
walsenders.

I don't think it could be possible to create logical walsenders on a standby if
AllowCascadeReplication() is not true, or am I missing something?

If so, I think it has to be set to true for the logical walsenders in all the case (like
done in V62 posted up-thread).

Andres, made the point up-thread that RecoveryInProgress() is always true, and
as we don't want to be woken up only when there is a time line change then I think
it has to be always true for logical walsenders.

0005
2.
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
#include "utils/pg_lsn.h"
#include "utils/timestamp.h"
#include "utils/tuplestore.h"
+#include "storage/standby.h"

The header includes should be in alphabetical order.

Good catch, thanks! Done in V62.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#318

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#311)

Re: Minimal logical decoding on standbys

Hi,

On 4/6/23 7:59 AM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

On 4/5/23 12:28 PM, Amit Kapila wrote:

On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Maybe we could change the doc with something among those lines instead?

"
Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.

It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
primary won't be decoded".

I don't know whether this is what one would expect but that should be less of a surprise if documented.

What do you think?

Yeah, I think it is better to document to avoid any surprises if
nobody else sees any problem with it.

Ack.

This doesn't seem to be addressed in the latest version.

Right, I was waiting if "nobody else sees any problem with it".

Added it now in V62 posted up-thread.

And today, I
think I see one more point about this doc change:
+    <para>
+     A logical replication slot can also be created on a hot standby.
To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot
between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but
only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on
primary is reduced to
+     less than 'logical'.

If hot_standby_feedback is not set then can logical decoding on
standby misbehave? If so, that is not very clear from this doc change
if that is acceptable.

I don't think it would misbehave but that primary may delete system catalog rows
that could be needed by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

Added this remark in V62.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#319

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#314)

Re: Minimal logical decoding on standbys

Hi,

On 4/6/23 2:23 PM, Amit Kapila wrote:

On Thu, Apr 6, 2023 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Thinking some more on this, I think such a slot won't decode any other
records. During CreateInitDecodingContext->ReplicationSlotReserveWal,
for standby's, we use lastReplayedEndRecPtr as restart_lsn. This
should be a record before parameter_change record in the above
scenario. So, ideally, the first record to decode by such a walsender
should be parameter_change which will anyway error out. So, this
shouldn't be a problem.

Agree.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#320

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#317)

Re: Minimal logical decoding on standbys

On Thu, Apr 6, 2023 at 6:32 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On 4/6/23 11:55 AM, Amit Kapila wrote:
On Thu, Apr 6, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Another comment on 0001.
extern void CheckSlotRequirements(void);
extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
TransactionId xid, char *reason);

This doesn't seem to be called from anywhere.
Few other comments:
==================
0004
1.
+ *  - physical walsenders in case of new time line and cascade
+ *  replication is allowed.
+ *  - logical walsenders in case of new time line or recovery is in progress
+ *  (logical decoding on standby).
+ */
+ WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+ switchedTLI || RecoveryInProgress());
Do we need AllowCascadeReplication() check specifically for physical
walsenders? I think this should be true for both physical and logical
walsenders.
I don't think it could be possible to create logical walsenders on a standby if
AllowCascadeReplication() is not true, or am I missing something?

Right, so why to even traverse walsenders for that case? What I was
imagining a code is like:
if (AllowCascadeReplication())
WalSndWakeup(switchedTLI, true);

Do you see any problem with this change?

Few more minor comments on 0005
=============================
0005
1.
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one.

/wait bgwriter/wait for bgwriter

2.
+use Test::More tests => 67;

We no more use the number of tests. Please refer to other similar tests.

--
With Regards,
Amit Kapila.

#321

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#320)

Re: Minimal logical decoding on standbys

Hi,

On 4/6/23 3:39 PM, Amit Kapila wrote:

On Thu, Apr 6, 2023 at 6:32 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

I don't think it could be possible to create logical walsenders on a standby if
AllowCascadeReplication() is not true, or am I missing something?

Right, so why to even traverse walsenders for that case? What I was
imagining a code is like:
if (AllowCascadeReplication())
WalSndWakeup(switchedTLI, true);

Do you see any problem with this change?

Not at all, it looks good to me.

Few more minor comments on 0005
=============================
0005
1.
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one.
/wait bgwriter/wait for bgwriter

2.
+use Test::More tests => 67;

We no more use the number of tests. Please refer to other similar tests.

Thanks! Will update 0005.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#322

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Amit Kapila (#312)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:

After this, I think for backends that have active slots, it would
simply cancel the current query. Will that be sufficient? Because we
want the backend process should exit and release the slot so that the
startup process can mark it invalid.

We don't need them to exit, we just need them to release the slot. Which does
happen when the query is cancelled. Imagine if that weren't the case - if a
cancellation of pg_logical_slot_* wouldn't release the slot, we couldn't call
it again before disconnecting. I also did verify that indeed the slot is
released upon a cancellation.

Greetings,

Andres Freund

#323

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Andres Freund (#322)

Re: Minimal logical decoding on standbys

On Fri, Apr 7, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:

After this, I think for backends that have active slots, it would
simply cancel the current query. Will that be sufficient? Because we
want the backend process should exit and release the slot so that the
startup process can mark it invalid.

We don't need them to exit, we just need them to release the slot. Which does
happen when the query is cancelled. Imagine if that weren't the case - if a
cancellation of pg_logical_slot_* wouldn't release the slot, we couldn't call
it again before disconnecting. I also did verify that indeed the slot is
released upon a cancellation.

makes sense. Thanks for the clarification!

--
With Regards,
Amit Kapila.

#324

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#315)

Re: Minimal logical decoding on standbys

Hi,

TBH, I don't like the state of 0001 much. I'm working on polishing it now.

A lot of the new functions in slot.h don't seem right to me:
- ObsoleteSlotIsInvalid() - isn't an obsolete slot by definition invalid?
- Why does ObsoleteSlotIsInvalid() sometime check invalidated_at and sometimes
not?
- DoNotInvalidateSlot() seems too generic a name for a function exposed to the
outside of slot.c
- TransactionIdIsValidPrecedesOrEquals() shouldn't be defined in slot.h -
also, it's not actually clear what semantics it's trying to have.
- there's no commonality in naming between the functions used to test if a
slot needs to be invalidated (SlotIsFreshEnough() vs
LogicalSlotIsNotConflicting()).

Leaving naming etc aside, most of these don't seem to belong in slot.h, but
should just be in slot.c - there aren't conceivable users from outside slot.c.

Independent of this patch: What's the point of invalidated_at? The only reads
of it are done like
invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
XLogRecPtrIsInvalid(s->data.restart_lsn));
i.e. the actual LSN is not used.

ISTM that we should just have it be a boolean, and that it should be used by
the different kinds of invalidating a slot.

Greetings,

Andres Freund

#325

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Amit Kapila (#312)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:

Also, it seems you have removed the checks related to
slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
used for logical slots? If so, do you think an Assert would make
sense?

The asserts that have been added aren't correct. There's no guarantee that the
receiver of the procsignal still holds the same slot or any slot at all.

Greetings,

Andres Freund

#326

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Andres Freund (#325)

Re: Minimal logical decoding on standbys

On Fri, Apr 7, 2023 at 8:43 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:

Also, it seems you have removed the checks related to
slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
used for logical slots? If so, do you think an Assert would make
sense?

The asserts that have been added aren't correct. There's no guarantee that the
receiver of the procsignal still holds the same slot or any slot at all.

For backends, that don't hold any slot, can we skip setting the
RecoveryConflictPending and other flags?

--
With Regards,
Amit Kapila.

#327

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#321)

Re: Minimal logical decoding on standbys

On Thu, Apr 6, 2023 at 7:50 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Thanks! Will update 0005.

I noticed a few typos in the latest patches.

0004
1.
+ * Physical walsenders don't need to be wakon up during replay unless

Typo.

0005
2.
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+ my ($slot_user_handle) = @_;
+
+ is($node_standby->slot('inactiveslot')->{'slot_type'}, '',
'inactiveslot on standby dropped');
+ is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot
on standby dropped');
+
+ check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+ my ($hsf, $invalidated) = @_;
+
+ $node_standby->append_conf('postgresql.conf',qq[
+ hot_standby_feedback = $hsf
+ ]);
+
+ $node_standby->reload;
+
+ if ($hsf && $invalidated)
+ {
...

The comment above change_hot_standby_feedback_and_wait_for_xmins seems
to be wrong. It seems to be copied from the previous test.

3.
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+ 'postgres',
+ "select (confl_active_logicalslot = 4) from
pg_stat_database_conflicts where datname = 'testdb'", 't'),
+ 'confl_active_logicalslot updated') or die "Timed out waiting
confl_active_logicalslot to be updated";

The comment incorrectly mentions 3 conflicts whereas the query expects 4.

--
With Regards,
Amit Kapila.

#328

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#323)

Re: Minimal logical decoding on standbys

On 4/7/23 3:59 AM, Amit Kapila wrote:

On Fri, Apr 7, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:

After this, I think for backends that have active slots, it would
simply cancel the current query. Will that be sufficient? Because we
want the backend process should exit and release the slot so that the
startup process can mark it invalid.

We don't need them to exit, we just need them to release the slot. Which does
happen when the query is cancelled. Imagine if that weren't the case - if a
cancellation of pg_logical_slot_* wouldn't release the slot, we couldn't call
it again before disconnecting. I also did verify that indeed the slot is
released upon a cancellation.

makes sense. Thanks for the clarification!

+1, thanks Andres!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#329

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#324)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 4:18 AM, Andres Freund wrote:

Hi,

TBH, I don't like the state of 0001 much. I'm working on polishing it now.

Thanks Andres!

A lot of the new functions in slot.h don't seem right to me:
- ObsoleteSlotIsInvalid() - isn't an obsolete slot by definition invalid?

bad naming, agree.

- Why does ObsoleteSlotIsInvalid() sometime check invalidated_at and sometimes
not?

because part of the existing code was doing so (checking if s->data.restart_lsn is valid
with/without checking if data.invalidated_at is valid) and I thought it was better not
to change it.

- TransactionIdIsValidPrecedesOrEquals() shouldn't be defined in slot.h -
also, it's not actually clear what semantics it's trying to have.

Oh right, my bad for the location.

- there's no commonality in naming between the functions used to test if a
slot needs to be invalidated (SlotIsFreshEnough() vs
LogicalSlotIsNotConflicting()).

Agree, my bad.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#330

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#321)

6 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/6/23 4:20 PM, Drouvot, Bertrand wrote:

Hi,

On 4/6/23 3:39 PM, Amit Kapila wrote:

On Thu, Apr 6, 2023 at 6:32 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

I don't think it could be possible to create logical walsenders on a standby if
AllowCascadeReplication() is not true, or am I missing something?

Right, so why to even traverse walsenders for that case? What I was
imagining a code is like:
if (AllowCascadeReplication())
WalSndWakeup(switchedTLI, true);

Do you see any problem with this change?

Not at all, it looks good to me.

Done in V63 attached and did change the associated comment a bit.

Few more minor comments on 0005
=============================
0005
1.
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one.
/wait bgwriter/wait for bgwriter

2.
+use Test::More tests => 67;

We no more use the number of tests. Please refer to other similar tests.
Thanks! Will update 0005.

Done in V63.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v63-0006-Doc-changes-describing-details-about-logical-dec.patchtext/plain; charset=UTF-8; name=v63-0006-Doc-changes-describing-details-about-logical-dec.patchDownload

From b5ff35d8ace1e429c6a15d53203b00304a3ff1f4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH v63 6/6] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)
 100.0% doc/src/sgml/

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd4..8651024b8a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,33 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Then, the
+     primary may delete system catalog rows that could be needed by the logical
+     decoding on the standby (as it does not know about the catalog_xmin on the
+     standby). Existing logical slots on standby also get invalidated if wal_level
+     on primary is reduced to less than 'logical'. This is done as soon as the
+     standby detects such a change in the WAL stream. It means, that for walsenders
+     that are lagging (if any), some WAL records up to the wal_level parameter change
+     on the primary won't be decoded.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.34.1

v63-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v63-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

From ee35ea655de6d3178a1ec1b7b70345e2f4adccb5 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH v63 5/6] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/func.sgml                        |  15 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 706 ++++++++++++++++++
 7 files changed, 796 insertions(+)
   3.1% src/backend/
   4.0% src/test/perl/PostgreSQL/Test/
  89.6% src/test/recovery/t/

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index dc44a74eb2..9253cd1c18 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27032,6 +27032,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f..36a309b54c 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -31,6 +31,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca893444..b7c65ea37d 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d..284138727e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e9..62376de602 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..e834ad5e0d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 0000000000..5b6d19d379
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,706 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Change hot_standby_feedback and check xmin and catalog_xmin values.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\""),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\""),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 3: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 4 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
+
+done_testing();
-- 
2.34.1

v63-0004-For-cascading-replication-wake-up-physical-walse.patchtext/plain; charset=UTF-8; name=v63-0004-For-cascading-replication-wake-up-physical-walse.patchDownload

From c8cc725b967a705f4626089f083d56190d5c5d44 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH v63 4/6] For cascading replication, wake up physical
 walsenders separately from logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Author: Bertrand Drouvot per idea from Jeff Davis and Amit Kapila.
Reviewed-By: Sawada Masahiko, Robert Haas.
---
 src/backend/access/transam/xlog.c           |  6 +--
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 30 +++++++++++---
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 43 +++++++++++++++++----
 src/include/replication/walsender.h         | 22 +++++------
 src/include/replication/walsender_private.h |  3 ++
 7 files changed, 79 insertions(+), 29 deletions(-)
  35.6% src/backend/access/transam/
  47.8% src/backend/replication/
  16.5% src/include/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68dfb0344c..caeffc5860 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5773,7 +5773,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b5..f3fb92c8f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..e6427c54c5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * Wakeup walsenders:
+	 *
+	 * On the standby, the WAL is flushed first (which will only wake up
+	 * physical walsenders) and then applied, which will only wake up logical
+	 * walsenders.
+	 *
+	 * Indeed, logical walsenders on standby can't decode and send data until
+	 * it's been applied.
+	 *
+	 * Physical walsenders don't need to be woken up during replay unless
+	 * cascading replication is allowed and time line change occured (so that
+	 * they can notice that they are on a new time line).
+	 *
+	 * That's why the wake up conditions are for:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - logical walsenders in case cascade replication is allowed (could not
+	 *  be created otherwise).
+	 */
+	if (AllowCascadeReplication())
+		WalSndWakeup(switchedTLI, true);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1982,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d..feff709435 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, false);
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d908d1de2..97990e1827 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2628,6 +2628,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3310,30 +3327,42 @@ WalSndShmemInit(void)
 }
 
 /*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done, because:
+ * - physical walsenders can't send data until it's been flushed
+ * - logical walsenders on standby can't decode and send data until it's been
+ * applied
+ *
+ * For cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
  *
  * This will be called inside critical sections, so throwing an error is not
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch == NULL)
+			continue;
+
+		if ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+			(logical && kind == REPLICATION_KIND_LOGICAL))
 			SetLatch(latch);
 	}
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..9df7e50f94 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
-	} while (0)
+static inline void
+WalSndWakeupProcessRequests(bool physical, bool logical)
+{
+	if (wake_wal_senders)
+	{
+		wake_wal_senders = false;
+		if (max_wal_senders > 0)
+			WalSndWakeup(physical, logical);
+	}
+}
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..ff25aa70a8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
-- 
2.34.1

v63-0003-Allow-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v63-0003-Allow-logical-decoding-on-standby.patchDownload

From 3ea6d1da37e4c48eee518a0ff1e72225df6a0bed Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 12:45:20 +0000
Subject: [PATCH v63 3/6] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 30 +++++++++++-
 src/backend/replication/logical/logical.c | 36 +++++++-------
 src/backend/replication/slot.c            | 58 ++++++++++++-----------
 src/backend/replication/walsender.c       | 48 ++++++++++++-------
 src/include/access/xlog.h                 |  1 +
 6 files changed, 123 insertions(+), 61 deletions(-)
   4.3% src/backend/access/transam/
  38.9% src/backend/replication/logical/
  55.9% src/backend/replication/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 713b61a9bf..68dfb0344c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1..8352dbf5df 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,39 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				{
+					/*
+					 * This can occur only on a standby, as a primary would
+					 * not allow to restart after changing wal_level < logical
+					 * if there is pre-existing logical slot.
+					 */
+					Assert(RecoveryInProgress());
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				}
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a6..60a5008b6d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 376d453374..b9b16b9191 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1183,37 +1184,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
+		if (SlotIsPhysical(slot))
+			restart_lsn = GetRedoRecPtr();
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
 			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
 
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
-			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1229,8 +1221,18 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
-}
 
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
+}
 
 /*
  * Report terminating or conflicting message.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b686691ca2..2d908d1de2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,34 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
+
+	/*
+	 * Make sure we have enough WAL available before retrieving the current
+	 * timeline. This is needed to determine am_cascading_walsender accurately
+	 * which is needed to determine the current timeline.
+	 */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
 	 */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +948,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3073,10 +3084,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3167,7 +3182,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..48ca852381 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
-- 
2.34.1

v63-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchtext/plain; charset=UTF-8; name=v63-0002-Arrange-for-a-new-pg_stat_database_conflicts-and.patchDownload

From 79e89eebbe7d69bb5c18831e6d05b3ce391eec42 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH v63 2/6] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 12 +++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 57 insertions(+), 9 deletions(-)
  33.7% doc/src/sgml/
   8.1% src/backend/catalog/
  13.1% src/backend/replication/
   5.9% src/backend/utils/activity/
   5.6% src/backend/utils/adt/
  24.6% src/include/catalog/
   6.9% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bce9ae4661..fa3b0f810c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..57b228076e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6b098234f8..c25067d06d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 015d276fd9..6473c73eca 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -403,6 +403,16 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (LogicalReplicationSlotIsInvalid(slot))
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaad..7149f22f72 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..4de60d8aa1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..bcbae9036d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e79b8a34eb..5e8b04d21b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde4..06d3f1f5d3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.34.1

v63-0001-Handle-logical-slot-conflicts-on-standby.patchtext/plain; charset=UTF-8; name=v63-0001-Handle-logical-slot-conflicts-on-standby.patchDownload

From 67ce45537f9766da93de828a40bb46dee75dd846 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 08:57:56 +0000
Subject: [PATCH v63 1/6] Handle logical slot conflicts on standby.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy, Amit Kapila, Álvaro Herrera
---
 src/backend/access/gist/gistxlog.c            |   2 +
 src/backend/access/hash/hash_xlog.c           |   1 +
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/nbtree/nbtxlog.c           |   2 +
 src/backend/access/spgist/spgxlog.c           |   1 +
 src/backend/access/transam/xlog.c             |  20 ++-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slot.c                | 170 +++++++++++++-----
 src/backend/replication/slotfuncs.c           |   3 +-
 src/backend/replication/walsender.c           |   7 +
 src/backend/storage/ipc/procsignal.c          |   3 +
 src/backend/storage/ipc/standby.c             |  14 +-
 src/backend/tcop/postgres.c                   |  18 ++
 src/include/replication/slot.h                |  61 ++++++-
 src/include/storage/procsignal.h              |   1 +
 src/include/storage/standby.h                 |   2 +
 16 files changed, 270 insertions(+), 51 deletions(-)
   7.5% src/backend/access/transam/
   5.6% src/backend/replication/logical/
  58.7% src/backend/replication/
   5.2% src/backend/storage/ipc/
   4.8% src/backend/tcop/
  14.2% src/include/replication/
   3.5% src/

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c14..9a86fb3fef 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3..e8e06c62a9 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8b13e3f892..f389ceee1e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8769,6 +8769,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8940,6 +8941,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9061,6 +9063,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6de..c87e46ed66 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8a..459ac929ba 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad605..713b61a9bf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
+										   InvalidTransactionId, false))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
+										   InvalidTransactionId, false))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7963,6 +7965,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateObsoleteReplicationSlots(0, InvalidOid,
+											   InvalidTransactionId, true);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b..575a047e53 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -216,9 +216,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
+		 * restart_lsn is valid or both effective_xmin and catalog_xmin are
+		 * valid. Avoid "cannot get changes" wording in this errmsg because
+		 * that'd be confusingly ambiguous about no changes being available.
 		 */
 		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 			ereport(ERROR,
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 							NameStr(*name)),
 					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
 
+		if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("cannot read from logical replication slot \"%s\"",
+							NameStr(*name)),
+					 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc..376d453374 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -110,6 +110,13 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/* to report termination/invalidation */
+static void ReportTerminationInvalidation(bool terminating, bool check_on_xid,
+										  int pid, NameData slotname,
+										  TransactionId xid,
+										  XLogRecPtr restart_lsn,
+										  XLogRecPtr oldestLSN);
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -1225,28 +1231,85 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+
+/*
+ * Report terminating or conflicting message.
+ *
+ * For both, logical conflict on standby and obsolete slot are handled.
+ */
+static void
+ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
+							  NameData slotname, TransactionId xid,
+							  XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
+{
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	if (check_on_xid)
+	{
+		if (TransactionIdIsValid(xid))
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), xid);
+		else
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+	}
+	else
+	{
+		appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+						 LSN_FORMAT_ARGS(restart_lsn),
+						 (unsigned long long) (oldestLSN - restart_lsn));
+
+		hint = true;
+	}
+
+	ereport(LOG,
+			terminating ? errmsg("terminating process %d to release replication slot \"%s\"", pid, NameStr(slotname)) :
+			errmsg("invalidating obsolete replication slot \"%s\"", NameStr(slotname)),
+			errdetail_internal("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_detail.data);
+}
+
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
  *
- * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
+ * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
  *
  * This is inherently racy, because we release the LWLock
  * for syscalls, so caller must restart if we return true.
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   Oid dboid, bool *invalidated, TransactionId xid,
+							   bool check_on_xid)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
 
+	if (check_on_xid)
+	{
+		/* we are only dealing with *logical* slot conflicts */
+		if (!SlotIsLogical(s))
+			return false;
+
+		/*
+		 * not the database of interest and we don't want all the
+		 * database, skip
+		 */
+		if (s->data.database != dboid && TransactionIdIsValid(xid))
+			return false;
+	}
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
+
 		NameData	slotname;
 		int			active_pid = 0;
 
@@ -1263,19 +1326,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 * Check if the slot needs to be invalidated. If it needs to be
 		 * invalidated, and is not currently acquired, acquire it and mark it
 		 * as having been invalidated.  We do this with the spinlock held to
-		 * avoid race conditions -- for example the restart_lsn could move
-		 * forward, or the slot could be dropped.
+		 * avoid race conditions -- for example the restart_lsn (or the
+		 * xmin(s) could) move forward or the slot could be dropped.
 		 */
 		SpinLockAcquire(&s->mutex);
 
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (DoNotInvalidateSlot(s, xid, oldestLSN, check_on_xid))
 		{
+			/* then, we are not forcing for invalidation */
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
 				LWLockRelease(ReplicationSlotControlLock);
@@ -1294,9 +1358,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
-			s->data.restart_lsn = InvalidXLogRecPtr;
-
+			if (check_on_xid)
+			{
+				s->effective_xmin = InvalidTransactionId;
+				s->data.catalog_xmin = InvalidTransactionId;
+			}
+			else
+			{
+				s->data.invalidated_at = restart_lsn;
+				s->data.restart_lsn = InvalidXLogRecPtr;
+			}
 			/* Let caller know */
 			*invalidated = true;
 		}
@@ -1329,15 +1400,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
-
-				(void) kill(active_pid, SIGTERM);
+				ReportTerminationInvalidation(true, check_on_xid, active_pid,
+											  slotname, xid, restart_lsn,
+											  oldestLSN);
+
+				if (check_on_xid)
+					(void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
@@ -1370,14 +1441,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportTerminationInvalidation(false, check_on_xid, active_pid,
+										  slotname, xid, restart_lsn,
+										  oldestLSN);
 
 			/* done with this slot for now */
 			break;
@@ -1390,20 +1458,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate Obsolete slots.
+ *
+ * WAL case (aka check_on_xid is false):
+ *
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
  *
- * Returns true when any slot have got invalidated.
+ * Xid case (aka check_on_xid is true):
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated. Note that WaitExceedsMaxStandbyDelay() is not taken into
+ *	 account here (as opposed to ResolveRecoveryConflictWithVirtualXIDs()): XXXX
  *
- * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
+								   TransactionId xid, bool check_on_xid)
 {
-	XLogRecPtr	oldestLSN;
+
+	XLogRecPtr	oldestLSN = InvalidXLogRecPtr;
 	bool		invalidated = false;
 
-	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+	Assert(max_replication_slots >= 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
+	if (!check_on_xid)
+		XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -1414,16 +1501,15 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
-		{
-			/* if the lock was released, start from scratch */
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, dboid, &invalidated, xid, check_on_xid))
 			goto restart;
-		}
 	}
+
 	LWLockRelease(ReplicationSlotControlLock);
 
 	/*
-	 * If any slots have been invalidated, recalculate the resource limits.
+	 * If any slots have been invalidated, recalculate the required xmin and
+	 * the required lsn (if appropriate).
 	 */
 	if (invalidated)
 	{
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c964824..015d276fd9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -319,8 +319,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		 * certain that the slot has been invalidated.  Otherwise, test
 		 * availability from restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (ObsoleteSlotIsInvalid(slot, true))
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b686691ca2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,6 +1253,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
+	if (LogicalReplicationSlotIsInvalid(MyReplicationSlot))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..c85cb5cc18 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95c..a23220cae7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,10 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(0, locator.dbOid,
+										   snapshotConflictHorizon, true);
 }
 
 /*
@@ -499,6 +505,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +524,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1487,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50..4ec64b0a4a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3143,6 +3146,21 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				InterruptPending = true;
 				break;
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				/* This signal is only used for logical sloti, sanity check */
+				Assert(MyReplicationSlot && SlotIsLogical(MyReplicationSlot));
+				RecoveryConflictPending = true;
+				QueryCancelPending = true;
+				InterruptPending = true;
+
+				/*
+				 * Normal backends should exit, so that the startup process
+				 * can mark the slot invalid.
+				 */
+				if (!am_cascading_walsender)
+					ProcDiePending = true;
+				break;
+
 			default:
 				elog(FATAL, "unrecognized conflict mode: %d",
 					 (int) reason);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdf..f2838022e5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,6 +168,64 @@ typedef struct ReplicationSlot
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
 #define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)
 
+static inline bool
+ObsoleteSlotIsInvalid(ReplicationSlot *s, bool check_invalidated_at)
+{
+	if (check_invalidated_at)
+		return (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
+				XLogRecPtrIsInvalid(s->data.restart_lsn));
+	else
+		return (XLogRecPtrIsInvalid(s->data.restart_lsn));
+}
+
+static inline bool
+LogicalReplicationSlotIsInvalid(ReplicationSlot *s)
+{
+	return (!TransactionIdIsValid(s->effective_xmin) &&
+			!TransactionIdIsValid(s->data.catalog_xmin));
+}
+
+static inline bool
+TransactionIdIsValidPrecedesOrEquals(TransactionId xid1, TransactionId xid2)
+{
+	return (TransactionIdIsValid(xid1) && TransactionIdPrecedesOrEquals(xid1, xid2));
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+	TransactionId slot_effective_xmin;
+	TransactionId slot_catalog_xmin;
+
+	slot_effective_xmin = s->effective_xmin;
+	slot_catalog_xmin = s->data.catalog_xmin;
+
+	return (TransactionIdIsValidPrecedesOrEquals(slot_effective_xmin, xid) ||
+			TransactionIdIsValidPrecedesOrEquals(slot_catalog_xmin, xid));
+}
+
+static inline bool
+SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	return (s->data.restart_lsn >= oldestLSN);
+}
+
+static inline bool
+LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId xid)
+{
+	return (TransactionIdIsValid(xid) && !LogicalReplicationSlotXidsConflict(s, xid));
+}
+
+static inline bool
+DoNotInvalidateSlot(ReplicationSlot *s, TransactionId xid, XLogRecPtr oldestLSN, bool check_on_xid)
+{
+	if (check_on_xid)
+		return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
+	else
+		return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, oldestLSN));
+
+}
+
 /*
  * Shared memory control area for all of replication slots.
  */
@@ -215,7 +273,8 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
+											   TransactionId xid, bool check_on_xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231b..2f52100b00 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126..41f4dc372e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
-- 
2.34.1

#331

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#327)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 5:47 AM, Amit Kapila wrote:

On Thu, Apr 6, 2023 at 7:50 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Thanks! Will update 0005.

I noticed a few typos in the latest patches.

0004
1.
+ * Physical walsenders don't need to be wakon up during replay unless

Typo.

Thanks! Fixed in V63 just posted up-thread.

0005
2.
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+ my ($slot_user_handle) = @_;
+
+ is($node_standby->slot('inactiveslot')->{'slot_type'}, '',
'inactiveslot on standby dropped');
+ is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot
on standby dropped');
+
+ check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+ my ($hsf, $invalidated) = @_;
+
+ $node_standby->append_conf('postgresql.conf',qq[
+ hot_standby_feedback = $hsf
+ ]);
+
+ $node_standby->reload;
+
+ if ($hsf && $invalidated)
+ {
...

The comment above change_hot_standby_feedback_and_wait_for_xmins seems
to be wrong. It seems to be copied from the previous test.

Good catch! Fixed in V63.

3.
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+ 'postgres',
+ "select (confl_active_logicalslot = 4) from
pg_stat_database_conflicts where datname = 'testdb'", 't'),
+ 'confl_active_logicalslot updated') or die "Timed out waiting
confl_active_logicalslot to be updated";

The comment incorrectly mentions 3 conflicts whereas the query expects 4.

Good catch, fixed in v63.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#332

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#330)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 07:02:04 +0200, Drouvot, Bertrand wrote:

Done in V63 attached and did change the associated comment a bit.

Can you send your changes incrementally, relative to V62? I'm polishing them
right now, and that'd make it a lot easier to apply your changes ontop.

Greetings,

Andres Freund

#333

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#332)

2 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 7:56 AM, Andres Freund wrote:

Hi,

On 2023-04-07 07:02:04 +0200, Drouvot, Bertrand wrote:

Done in V63 attached and did change the associated comment a bit.

Can you send your changes incrementally, relative to V62? I'm polishing them
right now, and that'd make it a lot easier to apply your changes ontop.

Sure, please find them enclosed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v62-63-0005-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v62-63-0005-New-TAP-test-for-logical-decoding-on-standby.patchDownload

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index a4f9a3c972..5b6d19d379 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -6,7 +6,7 @@ use warnings;
 
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
-use Test::More tests => 67;
+use Test::More;
 
 my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
 
@@ -112,8 +112,7 @@ sub check_slots_dropped
 	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
 }
 
-# Check if all the slots on standby are dropped. These include the 'activeslot'
-# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+# Change hot_standby_feedback and check xmin and catalog_xmin values.
 sub change_hot_standby_feedback_and_wait_for_xmins
 {
 	my ($hsf, $invalidated) = @_;
@@ -560,7 +559,7 @@ ok( find_in_log(
   'activeslot slot invalidation is logged due to wal_level');
 
 # Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-# we now expect 3 conflicts reported as the counter persist across reloads
+# we now expect 4 conflicts reported as the counter persist across reloads
 ok( $node_standby->poll_query_until(
 	'postgres',
 	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
@@ -703,3 +702,5 @@ ok( pump_until(
 chomp($cascading_stdout);
 is($cascading_stdout, $expected,
     'got same expected output from pg_recvlogical decoding session on cascading standby');
+
+done_testing();

v62-63-0004-For-cascading-replication-wake-up-physical-walse.patchtext/plain; charset=UTF-8; name=v62-63-0004-For-cascading-replication-wake-up-physical-walse.patchDownload

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cc4f7b5302..e6427c54c5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1945,7 +1945,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 * Indeed, logical walsenders on standby can't decode and send data until
 	 * it's been applied.
 	 *
-	 * Physical walsenders don't need to be wakon up during replay unless
+	 * Physical walsenders don't need to be woken up during replay unless
 	 * cascading replication is allowed and time line change occured (so that
 	 * they can notice that they are on a new time line).
 	 *
@@ -1953,9 +1953,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 *
 	 *  - physical walsenders in case of new time line and cascade
 	 *  replication is allowed.
-	 *  - always true for logical walsenders.
+	 *  - logical walsenders in case cascade replication is allowed (could not
+	 *  be created otherwise).
 	 */
-	WalSndWakeup(switchedTLI && AllowCascadeReplication(), true);
+	if (AllowCascadeReplication())
+		WalSndWakeup(switchedTLI, true);
 
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the

#334

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#333)

9 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 08:09:50 +0200, Drouvot, Bertrand wrote:

Hi,

On 4/7/23 7:56 AM, Andres Freund wrote:

Hi,

On 2023-04-07 07:02:04 +0200, Drouvot, Bertrand wrote:

Done in V63 attached and did change the associated comment a bit.

Can you send your changes incrementally, relative to V62? I'm polishing them
right now, and that'd make it a lot easier to apply your changes ontop.

Sure, please find them enclosed.

Thanks.

Here's my current working state - I'll go to bed soon.

Changes:

- shared catalog relations weren't handled correctly, because the dboid is
InvalidOid for them. I wrote a test for that as well.

- ReplicationSlotsComputeRequiredXmin() took invalidated logical slots into
account (ReplicationSlotsComputeLogicalRestartLSN() too, but it never looks
at logical slots)

- I don't think the subset of slot xids that were checked when invalidating
was right. We need to check effective_xmin and effective_catalog_xmin - the
latter was using catalog_xmin.

- similarly, it wasn't right that specifically those two fields were
overwritten when invalidated - as that was done, I suspect the changes might
get lost on a restart...

- As mentioned previously, I did not like all the functions in slot.h, nor
their naming. Not yet quite finished with that, but a good bit further

- There were a lot of unrelated changes, e.g. removing comments like
* NB - this runs as part of checkpoint, so avoid raising errors if possible.

- I still don't like the order of the patches, fixing the walsender patches
after introducing support for logical decoding on standby. Reordered.

- I don't think logical slots being invalidated as checked e.g. in
pg_logical_replication_slot_advance()

- I didn't like much that InvalidatePossiblyObsoleteSlot() switched between
kill() and SendProcSignal() based on the "conflict". There very well could
be reasons to use InvalidatePossiblyObsoleteSlot() with an xid from outside
of the startup process in the future. Instead I made it differentiate based
on MyBackendType == B_STARTUP.

I also:

Added new patch that replaces invalidated_at with a new enum, 'invalidated',
listing the reason for the invalidation. I added a check for !invalidated to
ReplicationSlotsComputeRequiredLSN() etc.

Added new patch moving checks for invalid logical slots into
CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
no sense. As far as I can tell the old message in
pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
"never previously reserved WAL"

Split "Handle logical slot conflicts on standby." into two. I'm not sure that
should stay that way, but it made it easier to hack on
InvalidateObsoleteReplicationSlots.

Todo:
- write a test that invalidated logical slots stay invalidated across a restart
- write a test that invalidated logical slots do not lead to retaining WAL
- Further evolve the API of InvalidateObsoleteReplicationSlots()
- pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
- rename xid to snapshotConflictHorizon, that'd be more in line with the
ResolveRecoveryConflictWithSnapshot and easier to understand, I think

- The test could stand a bit of cleanup and consolidation
- No need to start 4 psql processes to do 4 updates, just do it in one
safe_psql()
- the sequence of drop_logical_slots(), create_logical_slots(),
change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
repeated quite a few times
- the stats queries checking for specific conflict counts, including
preceding tests, is pretty painful. I suggest to reset the stats at the
end of the test instead (likely also do the drop_logical_slot() there).
- it's hard to correlate postgres log and the tap test, because the slots
are named the same across all tests. Perhaps they could have a per-test
prefix?
- numbering tests is a PITA, I had to renumber the later ones, when adding a
test for shared catalog tables

My attached version does include your v62-63 incremental chnages.

Greetings,

Andres Freund

Attachments:

va65-0001-replication-slots-replace-invalidated_at-LSN-wi.patchtext/x-diff; charset=us-asciiDownload

From 1e5461e0019678a92192b0dd5d9bf3f7105f504d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 20:00:07 -0700
Subject: [PATCH va65 1/9] replication slots: replace invalidated_at LSN with
 an enum

---
 src/include/replication/slot.h      | 15 +++++++++++++--
 src/backend/replication/slot.c      | 21 ++++++++++++++++++---
 src/backend/replication/slotfuncs.c |  8 +++-----
 3 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdfe..793f0701b88 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
 	RS_TEMPORARY
 } ReplicationSlotPersistency;
 
+/*
+ * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
+ * 'invalidated' field is set to a value other than _NONE.
+ */
+typedef enum ReplicationSlotInvalidationCause
+{
+	RS_INVAL_NONE,
+	/* required WAL has been removed */
+	RS_INVAL_WAL,
+} ReplicationSlotInvalidationCause;
+
 /*
  * On-Disk data of a replication slot, preserved across restarts.
  */
@@ -72,8 +83,8 @@ typedef struct ReplicationSlotPersistentData
 	/* oldest LSN that might be required by this replication slot */
 	XLogRecPtr	restart_lsn;
 
-	/* restart_lsn is copied here when the slot is invalidated */
-	XLogRecPtr	invalidated_at;
+	/* RS_INVAL_NONE if valid, or the reason for having been invalidated */
+	ReplicationSlotInvalidationCause	invalidated;
 
 	/*
 	 * Oldest LSN that the client has acked receipt for.  This is used as the
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc3..06ff3559dd1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -855,8 +855,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -901,14 +900,20 @@ ReplicationSlotsComputeRequiredLSN(void)
 	{
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		XLogRecPtr	restart_lsn;
+		bool		invalidated;
 
 		if (!s->in_use)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
+		/* invalidated slots need not apply */
+		if (invalidated)
+			continue;
+
 		if (restart_lsn != InvalidXLogRecPtr &&
 			(min_required == InvalidXLogRecPtr ||
 			 restart_lsn < min_required))
@@ -946,6 +951,7 @@ ReplicationSlotsComputeLogicalRestartLSN(void)
 	{
 		ReplicationSlot *s;
 		XLogRecPtr	restart_lsn;
+		bool		invalidated;
 
 		s = &ReplicationSlotCtl->replication_slots[i];
 
@@ -960,8 +966,13 @@ ReplicationSlotsComputeLogicalRestartLSN(void)
 		/* read once, it's ok if it increases while we're checking */
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
+		/* invalidated slots need not apply */
+		if (invalidated)
+			continue;
+
 		if (restart_lsn == InvalidXLogRecPtr)
 			continue;
 
@@ -1012,6 +1023,8 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 		if (s->data.database != dboid)
 			continue;
 
+		/* NB: intentionally counting invalidated slots */
+
 		/* count slots with spinlock held */
 		SpinLockAcquire(&s->mutex);
 		(*nslots)++;
@@ -1069,6 +1082,8 @@ restart:
 		if (s->data.database != dboid)
 			continue;
 
+		/* NB: intentionally including invalidated slots */
+
 		/* acquire slot, so ReplicationSlotDropAcquired can be reused  */
 		SpinLockAcquire(&s->mutex);
 		/* can't change while ReplicationSlotControlLock is held */
@@ -1294,7 +1309,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
+			s->data.invalidated = RS_INVAL_WAL;
 			s->data.restart_lsn = InvalidXLogRecPtr;
 
 			/* Let caller know */
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c9648241..ad3e72be5ee 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -315,12 +315,10 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 			nulls[i++] = true;
 
 		/*
-		 * If invalidated_at is valid and restart_lsn is invalid, we know for
-		 * certain that the slot has been invalidated.  Otherwise, test
-		 * availability from restart_lsn.
+		 * If the slot has not been invalidated, test availability from
+		 * restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (slot_contents.data.invalidated != RS_INVAL_NONE)
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
-- 
2.38.0

va65-0002-Prevent-use-of-invalidated-logical-slot-in-Crea.patchtext/x-diff; charset=us-asciiDownload

From 4ae6d960b209d88120558d65379abb739d53f22d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 19:51:06 -0700
Subject: [PATCH va65 2/9] Prevent use of invalidated logical slot in
 CreateDecodingContext()

Previously we had checks for this in multiple places. Support for logical
decoding on standbys will add further checks, making it worth while to change
this.

TODO: Ensure the error message differences across walsender / SQL interface
are the right thing.  As far as I can tell, the "This slot has never
previously reserved WAL" portion in pg_logical_slot_get_changes_guts() was
bogus.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/replication/logical/logical.c      | 16 ++++++++++++++++
 src/backend/replication/logical/logicalfuncs.c | 13 -------------
 src/backend/replication/walsender.c            |  7 -------
 3 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a62..85fc49f655d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -518,6 +518,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 				 errmsg("replication slot \"%s\" was not created in this database",
 						NameStr(slot->data.name))));
 
+	/*
+	 * Check if slot has been invalidated due to max_slot_wal_keep_size. Avoid
+	 * "cannot get changes" wording in this errmsg because that'd be
+	 * confusingly ambiguous about no changes being available when called from
+	 * pg_logical_slot_get_changes_guts().
+	 */
+	if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("can no longer get changes from replication slot \"%s\"",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("This slot has been invalidated because it exceeded the maximum reserved size.")));
+
+	Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
+	Assert(MyReplicationSlot->data.restart_lsn != InvalidXLogRecPtr);
+
 	if (start_lsn == InvalidXLogRecPtr)
 	{
 		/* continue from last position */
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b0..55a24c02c94 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -214,19 +214,6 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 									LogicalOutputPrepareWrite,
 									LogicalOutputWrite, NULL);
 
-		/*
-		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
-		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("can no longer get changes from replication slot \"%s\"",
-							NameStr(*name)),
-					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
-
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e248..e40a9b1ba7b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,13 +1253,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
-	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot read from logical replication slot \"%s\"",
-						cmd->slotname),
-				 errdetail("This slot has been invalidated because it exceeded the maximum reserved size.")));
-
 	/*
 	 * Force a disconnect, so that the decoding code doesn't need to care
 	 * about an eventual switch from running in recovery, to running in a
-- 
2.38.0

va65-0003-Add-support-for-more-causes-to-InvalidateObsole.patchtext/x-diff; charset=us-asciiDownload

From cc1370fb4be75ce6619d9f29649a81822a16d977 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 23:14:39 -0700
Subject: [PATCH va65 3/9] Add support for more causes to
 InvalidateObsoleteReplicationSlots()

This supports invalidating all logical replication slots and invalidating
logical replication slots that conflict with an increased snapshot conflict
horizon.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/replication/slot.h            |   7 +-
 src/include/storage/procsignal.h          |   1 +
 src/backend/access/transam/xlog.c         |   6 +-
 src/backend/replication/logical/logical.c |   7 +
 src/backend/replication/slot.c            | 161 ++++++++++++++++++----
 src/backend/storage/ipc/procsignal.c      |   3 +
 src/backend/storage/ipc/standby.c         |   3 +
 src/backend/tcop/postgres.c               |   9 ++
 8 files changed, 167 insertions(+), 30 deletions(-)

diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 793f0701b88..35cafe94bc5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -46,6 +46,10 @@ typedef enum ReplicationSlotInvalidationCause
 	RS_INVAL_NONE,
 	/* required WAL has been removed */
 	RS_INVAL_WAL,
+	/* required rows have been removed */
+	RS_INVAL_XID,
+	/* wal_level insufficient for slot */
+	RS_INVAL_WAL_LEVEL,
 } ReplicationSlotInvalidationCause;
 
 /*
@@ -226,7 +230,8 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
+											   TransactionId xid);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231ba..2f52100b009 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad6056..5e964e2e96b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
+										   InvalidTransactionId))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
+										   InvalidTransactionId))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 85fc49f655d..27addd58f66 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -531,6 +531,13 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 						NameStr(MyReplicationSlot->data.name)),
 				 errdetail("This slot has been invalidated because it exceeded the maximum reserved size.")));
 
+	if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
 	Assert(MyReplicationSlot->data.restart_lsn != InvalidXLogRecPtr);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 06ff3559dd1..3e8c1b44e85 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1241,8 +1241,73 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Report that replication slot needs to be invalidated
+ */
+static void
+ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
+					   bool terminating,
+					   int pid,
+					   NameData slotname,
+					   XLogRecPtr restart_lsn,
+					   XLogRecPtr oldestLSN,
+					   TransactionId xid)
+{
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	switch (cause)
+	{
+		case RS_INVAL_WAL:
+			hint = true;
+			appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+							 LSN_FORMAT_ARGS(restart_lsn),
+							 (unsigned long long) (oldestLSN - restart_lsn));
+			break;
+		case RS_INVAL_XID:
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), xid);
+			break;
+
+		case RS_INVAL_WAL_LEVEL:
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+			break;
+		case RS_INVAL_NONE:
+			pg_unreachable();
+	}
+
+	ereport(LOG,
+			terminating ? errmsg("terminating process %d to release replication slot \"%s\"", pid, NameStr(slotname)) :
+			errmsg("invalidating obsolete replication slot \"%s\"", NameStr(slotname)),
+			errdetail_internal("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_detail.data);
+}
+
+/* FIXME: This is a too generic name */
+static inline bool
+TransactionIdIsValidPrecedesOrEquals(TransactionId xid1, TransactionId xid2)
+{
+	return (TransactionIdIsValid(xid1) && TransactionIdPrecedesOrEquals(xid1, xid2));
+}
+
+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, Oid dboid, TransactionId xid)
+{
+	/* an invalid DB oid signals a shared relation, need to conflict */
+	if (dboid != InvalidOid && dboid != s->data.database)
+		return false;
+
+	return
+		TransactionIdIsValidPrecedesOrEquals(s->effective_xmin, xid) ||
+		TransactionIdIsValidPrecedesOrEquals(s->effective_catalog_xmin, xid);
+}
+
+/*
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
@@ -1254,16 +1319,20 @@ ReplicationSlotReserveWal(void)
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   bool *invalidated)
+							   Oid dboid, TransactionId xid, bool *invalidated)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		invalidate_all_logical = !TransactionIdIsValid(xid) &&
+		oldestLSN == InvalidXLogRecPtr;
+
 
 	for (;;)
 	{
 		XLogRecPtr	restart_lsn;
 		NameData	slotname;
 		int			active_pid = 0;
+		ReplicationSlotInvalidationCause conflict = RS_INVAL_NONE;
 
 		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
 
@@ -1286,10 +1355,23 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (s->data.invalidated == RS_INVAL_NONE)
+		{
+			if (oldestLSN != InvalidXLogRecPtr && s->data.restart_lsn != InvalidXLogRecPtr &&
+				s->data.restart_lsn < oldestLSN)
+				conflict = RS_INVAL_WAL;
+			if (TransactionIdIsValid(xid) && SlotIsLogical(s) &&
+				LogicalReplicationSlotXidsConflict(s, dboid, xid))
+				conflict = RS_INVAL_XID;
+			else if (invalidate_all_logical && SlotIsLogical(s))
+				conflict = RS_INVAL_WAL_LEVEL;
+		}
+
+		/* if there's no conflict, we're done */
+		if (conflict == RS_INVAL_NONE)
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1309,8 +1391,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated = RS_INVAL_WAL;
-			s->data.restart_lsn = InvalidXLogRecPtr;
+			s->data.invalidated = conflict;
+			if (conflict == RS_INVAL_WAL)
+				s->data.restart_lsn = InvalidXLogRecPtr;
 
 			/* Let caller know */
 			*invalidated = true;
@@ -1344,15 +1427,17 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				ReportSlotInvalidation(conflict, true, active_pid,
+									   slotname, restart_lsn,
+									   oldestLSN, xid);
+
+				if (MyBackendType == B_STARTUP)
+					(void) SendProcSignal(active_pid,
+										  PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
+										  InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
 
-				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
 			}
 
@@ -1385,14 +1470,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportSlotInvalidation(conflict, false, active_pid,
+								   slotname, restart_lsn,
+								   oldestLSN, xid);
 
 			/* done with this slot for now */
 			break;
@@ -1405,19 +1487,44 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate slots that require resources about to be removed.
  *
  * Returns true when any slot have got invalidated.
  *
+ * FIXME:
+ *
+ * WAL case (aka check_on_xid is false):
+ *
+ *	 Mark any slot that points to an LSN older than the given segment
+ *	 as invalid; it requires WAL that's about to be removed.
+ *	 invalidated is set to true when any slot have got invalidated.
+ *
+ * Xid case (aka check_on_xid is true):
+ *
+ *	 When xid is valid, it means that we are about to remove rows older than xid.
+ *	 Therefore we need to invalidate slots that depend on seeing those rows.
+ *	 When xid is invalid, invalidate all logical slots. This is required when the
+ *	 master wal_level is set back to replica, so existing logical slots need to
+ *	 be invalidated. Note that WaitExceedsMaxStandbyDelay() is not taken into
+ *	 account here (as opposed to ResolveRecoveryConflictWithVirtualXIDs()): XXXX
+ *
+ *
+ * XXX: Should we have the caller pass in a specific
+ * ReplicationSlotInvalidationCause that we should search for? That'd likely
+ * make some things a bit neater.
+ *
  * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
+								   TransactionId xid)
 {
 	XLogRecPtr	oldestLSN;
 	bool		invalidated = false;
 
+	if (max_replication_slots == 0)
+		return invalidated;
+
 	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
@@ -1429,7 +1536,7 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, dboid, xid, &invalidated))
 		{
 			/* if the lock was released, start from scratch */
 			goto restart;
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf6909..c85cb5cc18d 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95cf..fc81e17901c 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1478,6 +1478,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50b..25e0de4e0ff 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3143,6 +3146,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				InterruptPending = true;
 				break;
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				RecoveryConflictPending = true;
+				QueryCancelPending = true;
+				InterruptPending = true;
+				break;
+
 			default:
 				elog(FATAL, "unrecognized conflict mode: %d",
 					 (int) reason);
-- 
2.38.0

va65-0004-Handle-logical-slot-conflicts-on-standby.patchtext/x-diff; charset=iso-8859-1Download

From a07a6865e6286e82e010ad4ee55e7c50ab372f62 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 20:03:16 -0700
Subject: [PATCH va65 4/9] Handle logical slot conflicts on standby.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified,
invalidate such slots. Also do the same thing if wal_level on the primary server
is reduced to below logical and there are existing logical slots
on standby. Introduce a new ProcSignalReason value for slot
conflict recovery.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
Bharath Rupireddy, Amit Kapila, Ãlvaro Herrera
---
 src/include/storage/standby.h       |  2 ++
 src/backend/access/gist/gistxlog.c  |  2 ++
 src/backend/access/hash/hash_xlog.c |  1 +
 src/backend/access/heap/heapam.c    |  3 +++
 src/backend/access/nbtree/nbtxlog.c |  2 ++
 src/backend/access/spgist/spgxlog.c |  1 +
 src/backend/access/transam/xlog.c   | 14 ++++++++++++++
 src/backend/storage/ipc/standby.c   | 11 ++++++++++-
 8 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126f..41f4dc372e6 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c144..9a86fb3feff 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3f..e8e06c62a95 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8b13e3f8925..f389ceee1ea 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8769,6 +8769,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8940,6 +8941,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9061,6 +9063,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6deb..c87e46ed66e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8ac..459ac929ba5 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5e964e2e96b..80a7cd8948f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7965,6 +7965,20 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateObsoleteReplicationSlots(0, InvalidOid,
+											   InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index fc81e17901c..ce5842b0db6 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,10 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(0, locator.dbOid,
+										   snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +505,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +524,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
-- 
2.38.0

va65-0005-Arrange-for-a-new-pg_stat_database_conflicts-an.patchtext/x-diff; charset=us-asciiDownload

From ac180d395b710b6c764d3d528e2a9a39e5f42331 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH va65 5/9] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 12 +++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f26422015..bcbae9036d1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5605,6 +5605,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11071,9 +11076,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e79b8a34ebc..5e8b04d21b1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6b098234f8c..c25067d06de 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ad3e72be5ee..6035cf48160 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -402,6 +402,16 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.invalidated != RS_INVAL_NONE)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaade..7149f22f729 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b1..4de60d8aa14 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bce9ae46615..fa3b0f810cd 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a4184508..57b228076e8 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde42..06d3f1f5d34 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.38.0

va65-0006-For-cascading-replication-wake-up-physical-wals.patchtext/x-diff; charset=us-asciiDownload

From 9fba4ed4c5fcc686c5b574dd85f918b2e0e1feff Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH va65 6/9] For cascading replication, wake up physical
 walsenders separately from logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Author: Bertrand Drouvot per idea from Jeff Davis and Amit Kapila.
Reviewed-By: Sawada Masahiko, Robert Haas.
---
 src/include/replication/walsender.h         | 22 +++++------
 src/include/replication/walsender_private.h |  3 ++
 src/backend/access/transam/xlog.c           |  6 +--
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 30 +++++++++++---
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 43 +++++++++++++++++----
 7 files changed, 79 insertions(+), 29 deletions(-)

diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae3..9df7e50f943 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
-	} while (0)
+static inline void
+WalSndWakeupProcessRequests(bool physical, bool logical)
+{
+	if (wake_wal_senders)
+	{
+		wake_wal_senders = false;
+		if (max_wal_senders > 0)
+			WalSndWakeup(physical, logical);
+	}
+}
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c48..ff25aa70a89 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 80a7cd8948f..d30575d8741 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5762,7 +5762,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b58..f3fb92c8f96 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe93947627..e6427c54c57 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * Wakeup walsenders:
+	 *
+	 * On the standby, the WAL is flushed first (which will only wake up
+	 * physical walsenders) and then applied, which will only wake up logical
+	 * walsenders.
+	 *
+	 * Indeed, logical walsenders on standby can't decode and send data until
+	 * it's been applied.
+	 *
+	 * Physical walsenders don't need to be woken up during replay unless
+	 * cascading replication is allowed and time line change occured (so that
+	 * they can notice that they are on a new time line).
+	 *
+	 * That's why the wake up conditions are for:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - logical walsenders in case cascade replication is allowed (could not
+	 *  be created otherwise).
+	 */
+	if (AllowCascadeReplication())
+		WalSndWakeup(switchedTLI, true);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1982,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d3..feff7094351 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, false);
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e40a9b1ba7b..66493b6e896 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2603,6 +2603,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3280,30 +3297,42 @@ WalSndShmemInit(void)
 }
 
 /*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done, because:
+ * - physical walsenders can't send data until it's been flushed
+ * - logical walsenders on standby can't decode and send data until it's been
+ * applied
+ *
+ * For cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
  *
  * This will be called inside critical sections, so throwing an error is not
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch == NULL)
+			continue;
+
+		if ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+			(logical && kind == REPLICATION_KIND_LOGICAL))
 			SetLatch(latch);
 	}
 }
-- 
2.38.0

va65-0007-Allow-logical-decoding-on-standby.patchtext/x-diff; charset=us-asciiDownload

From 99dae75e451c8895b6603b693a9db47f877307a2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 23:34:22 -0700
Subject: [PATCH va65 7/9] Allow logical decoding on standby.

Allow a logical slot to be created on standby. Restrict its usage
or its creation if wal_level on primary is less than logical.
During slot creation, it's restart_lsn is set to the last replayed
LSN. Effectively, a logical slot creation on standby waits for an
xl_running_xact record to arrive from primary.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/include/access/xlog.h                 |  1 +
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/replication/logical/decode.c  | 30 +++++++++++-
 src/backend/replication/logical/logical.c | 36 ++++++++------
 src/backend/replication/slot.c            | 59 ++++++++++++-----------
 src/backend/replication/walsender.c       | 48 ++++++++++++------
 6 files changed, 124 insertions(+), 61 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738c..48ca8523810 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d30575d8741..a4153518fdb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1f..8352dbf5df6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,39 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				{
+					/*
+					 * This can occur only on a standby, as a primary would
+					 * not allow to restart after changing wal_level < logical
+					 * if there is pre-existing logical slot.
+					 */
+					Assert(RecoveryInProgress());
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				}
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 27addd58f66..659b08cd456 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 3e8c1b44e85..411f6c01491 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1192,37 +1193,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
+		if (SlotIsPhysical(slot))
 			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1238,6 +1230,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 66493b6e896..743e338b1b6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,34 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Make sure we have enough WAL available before retrieving the current
+	 * timeline. This is needed to determine am_cascading_walsender accurately
+	 * which is needed to determine the current timeline.
 	 */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/*
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
+	 */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +948,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3076,10 +3087,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3170,7 +3185,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
-- 
2.38.0

va65-0008-New-TAP-test-for-logical-decoding-on-standby.patchtext/x-diff; charset=us-asciiDownload

From 829a3358dc13f480d98b40fff921f3ba26e06a2c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH va65 8/9] New TAP test for logical decoding on standby.

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: Craig Ringer (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/include/catalog/pg_proc.dat               |   3 +
 src/backend/access/transam/xlogfuncs.c        |  32 +
 src/backend/catalog/system_functions.sql      |   2 +
 doc/src/sgml/func.sgml                        |  15 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 764 ++++++++++++++++++
 7 files changed, 854 insertions(+)
 create mode 100644 src/test/recovery/t/035_standby_logical_decoding.pl

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bcbae9036d1..284138727e2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6426,6 +6426,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f9..36a309b54cc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -31,6 +31,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca8934440..b7c65ea37d7 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index dc44a74eb25..9253cd1c188 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27032,6 +27032,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e91..62376de602b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3f..e834ad5e0dc 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 00000000000..e2c46a6bf6f
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,764 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node) = @_;
+	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
+	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_user_handle) = @_;
+
+	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Change hot_standby_feedback and check xmin and catalog_xmin values.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+# drop the logical slots
+drop_logical_slots();
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\""),
+  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\""),
+  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_class');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_class');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a shared catalog table
+# Scenario 3: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Trigger the conflict. The txid_current() is to ensure there's some WAL
+# record associated with the database, otherwise the wait below does not work
+# reliably.
+$node_primary->safe_psql('testdb', qq[
+  CREATE ROLE create_trash;
+  DROP ROLE create_trash;
+  SELECT txid_current();
+]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_authid;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged with vacuum on pg_authid');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged with vacuum on pg_authid');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 4: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# put hot standby feedback to off
+change_hot_standby_feedback_and_wait_for_xmins(0,1);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
+$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+# we now still expect 2 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+change_hot_standby_feedback_and_wait_for_xmins(0,0);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged with on-access pruning');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged with on-access pruning');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should be issued
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is logged due to wal_level');
+
+ok( find_in_log(
+   $node_standby,
+  "invalidating obsolete replication slot \"activeslot\"", $logstart),
+  'activeslot slot invalidation is logged due to wal_level');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+# we now expect 5 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 5) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots();
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped($handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby);
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby);
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
+
+done_testing();
-- 
2.38.0

va65-0009-Doc-changes-describing-details-about-logical-de.patchtext/x-diff; charset=us-asciiDownload

From de57f70bb40af0dfceee6cc19963cb260e80ae63 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH va65 9/9] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd48..8651024b8a6 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,33 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Then, the
+     primary may delete system catalog rows that could be needed by the logical
+     decoding on the standby (as it does not know about the catalog_xmin on the
+     standby). Existing logical slots on standby also get invalidated if wal_level
+     on primary is reduced to less than 'logical'. This is done as soon as the
+     standby detects such a change in the WAL stream. It means, that for walsenders
+     that are lagging (if any), some WAL records up to the wal_level parameter change
+     on the primary won't be decoded.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.38.0

#335

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#334)

2 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 9:50 AM, Andres Freund wrote:

Hi,
Here's my current working state - I'll go to bed soon.

Thanks a lot for this Andres!

Changes:

- shared catalog relations weren't handled correctly, because the dboid is
InvalidOid for them. I wrote a test for that as well.

- ReplicationSlotsComputeRequiredXmin() took invalidated logical slots into
account (ReplicationSlotsComputeLogicalRestartLSN() too, but it never looks
at logical slots)

- I don't think the subset of slot xids that were checked when invalidating
was right. We need to check effective_xmin and effective_catalog_xmin - the
latter was using catalog_xmin.

- similarly, it wasn't right that specifically those two fields were
overwritten when invalidated - as that was done, I suspect the changes might
get lost on a restart...

- As mentioned previously, I did not like all the functions in slot.h, nor
their naming. Not yet quite finished with that, but a good bit further

- There were a lot of unrelated changes, e.g. removing comments like
* NB - this runs as part of checkpoint, so avoid raising errors if possible.

- I still don't like the order of the patches, fixing the walsender patches
after introducing support for logical decoding on standby. Reordered.

- I don't think logical slots being invalidated as checked e.g. in
pg_logical_replication_slot_advance()

- I didn't like much that InvalidatePossiblyObsoleteSlot() switched between
kill() and SendProcSignal() based on the "conflict". There very well could
be reasons to use InvalidatePossiblyObsoleteSlot() with an xid from outside
of the startup process in the future. Instead I made it differentiate based
on MyBackendType == B_STARTUP.

Thanks for all of this and the above explanations.

I also:

Added new patch that replaces invalidated_at with a new enum, 'invalidated',
listing the reason for the invalidation.

Yeah, that's a great idea.

I added a check for !invalidated to
ReplicationSlotsComputeRequiredLSN() etc.

looked at 65-0001 and it looks good to me.

Added new patch moving checks for invalid logical slots into
CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
no sense. As far as I can tell the old message in
pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
"never previously reserved WAL"

looked at 65-0002 and it looks good to me.

Split "Handle logical slot conflicts on standby." into two. I'm not sure that
should stay that way, but it made it easier to hack on
InvalidateObsoleteReplicationSlots.

looked at 65-0003 and the others.

It's easier to understand/read the code now that the ReplicationSlotInvalidationCause
enum has been created and that data.invalidated also make use of the enum. It does "simplify"
the review and that looks good to me.

Todo:
- write a test that invalidated logical slots stay invalidated across a restart

Done in 65-66-0008 attached.

- write a test that invalidated logical slots do not lead to retaining WAL

I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
a standby.

- Further evolve the API of InvalidateObsoleteReplicationSlots()
- pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
- rename xid to snapshotConflictHorizon, that'd be more in line with the
ResolveRecoveryConflictWithSnapshot and easier to understand, I think

Done. The new API can be found in v65-66-InvalidateObsoleteReplicationSlots_API.patch
attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a switch/case
can now be used. The "default" case does not emit an error since this code runs as part
of checkpoint.

- The test could stand a bit of cleanup and consolidation
- No need to start 4 psql processes to do 4 updates, just do it in one
safe_psql()

Right, done in v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch attached.

- the sequence of drop_logical_slots(), create_logical_slots(),
change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
repeated quite a few times

grouped in reactive_slots_change_hfs_and_wait_for_xmins() in 65-66-0008 attached.

- the stats queries checking for specific conflict counts, including
preceding tests, is pretty painful. I suggest to reset the stats at the
end of the test instead (likely also do the drop_logical_slot() there).

Good idea, done in 65-66-0008 attached.

- it's hard to correlate postgres log and the tap test, because the slots
are named the same across all tests. Perhaps they could have a per-test
prefix?

Good point. Done in 65-66-0008 attached. Thanks to that and the stats reset the
check for invalidation is now done in a single function "check_for_invalidation" that looks
for invalidation messages in the logfile and in pg_stat_database_conflicts.

Thanks for the suggestions: the TAP test is now easier to read/understand.

- numbering tests is a PITA, I had to renumber the later ones, when adding a
test for shared catalog tables

Yeah, sorry about that, it has been fixed in V63.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v65-66-InvalidateObsoleteReplicationSlots_API.patchtext/plain; charset=UTF-8; name=v65-66-InvalidateObsoleteReplicationSlots_API.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a4153518fd..8dd8d40c07 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6818,7 +6818,7 @@ CreateCheckPoint(int flags)
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
 	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
-										   InvalidTransactionId))
+										   InvalidTransactionId, RS_INVAL_WAL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7263,7 +7263,7 @@ CreateRestartPoint(int flags)
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
 	if (InvalidateObsoleteReplicationSlots(_logSegNo, InvalidOid,
-										   InvalidTransactionId))
+										   InvalidTransactionId, RS_INVAL_WAL))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7988,7 +7988,8 @@ xlog_redo(XLogReaderState *record)
 			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
 			wal_level >= WAL_LEVEL_LOGICAL)
 			InvalidateObsoleteReplicationSlots(0, InvalidOid,
-											   InvalidTransactionId);
+											   InvalidTransactionId,
+											   RS_INVAL_WAL_LEVEL);
 
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 411f6c0149..3c527742de 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1322,12 +1322,13 @@ LogicalReplicationSlotXidsConflict(ReplicationSlot *s, Oid dboid, TransactionId
  */
 static bool
 InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
-							   Oid dboid, TransactionId xid, bool *invalidated)
+							   Oid dboid, TransactionId xid, bool *invalidated,
+							   ReplicationSlotInvalidationCause cause)
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
 	bool		invalidate_all_logical = !TransactionIdIsValid(xid) &&
-		oldestLSN == InvalidXLogRecPtr;
+	oldestLSN == InvalidXLogRecPtr;
 
 
 	for (;;)
@@ -1363,14 +1364,30 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		 */
 		if (s->data.invalidated == RS_INVAL_NONE)
 		{
-			if (oldestLSN != InvalidXLogRecPtr && s->data.restart_lsn != InvalidXLogRecPtr &&
-				s->data.restart_lsn < oldestLSN)
-				conflict = RS_INVAL_WAL;
-			if (TransactionIdIsValid(xid) && SlotIsLogical(s) &&
-				LogicalReplicationSlotXidsConflict(s, dboid, xid))
-				conflict = RS_INVAL_XID;
-			else if (invalidate_all_logical && SlotIsLogical(s))
-				conflict = RS_INVAL_WAL_LEVEL;
+			switch (cause)
+			{
+				case RS_INVAL_WAL:
+					if (oldestLSN != InvalidXLogRecPtr && s->data.restart_lsn != InvalidXLogRecPtr &&
+						s->data.restart_lsn < oldestLSN)
+						conflict = cause;
+					break;
+				case RS_INVAL_XID:
+					if (TransactionIdIsValid(xid) && SlotIsLogical(s) &&
+						LogicalReplicationSlotXidsConflict(s, dboid, xid))
+						conflict = cause;
+					break;
+				case RS_INVAL_WAL_LEVEL:
+					if (invalidate_all_logical && SlotIsLogical(s))
+						conflict = cause;
+					break;
+				default:
+					break;
+
+					/*
+					 * this runs as part of checkpoint, so avoid raising
+					 * errors if possible
+					 */
+			}
 		}
 
 		/* if there's no conflict, we're done */
@@ -1511,16 +1528,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
  *	 be invalidated. Note that WaitExceedsMaxStandbyDelay() is not taken into
  *	 account here (as opposed to ResolveRecoveryConflictWithVirtualXIDs()): XXXX
  *
- *
- * XXX: Should we have the caller pass in a specific
- * ReplicationSlotInvalidationCause that we should search for? That'd likely
- * make some things a bit neater.
- *
  * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
 bool
 InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
-								   TransactionId xid)
+								   TransactionId snapshotConflictHorizon,
+								   ReplicationSlotInvalidationCause cause)
 {
 	XLogRecPtr	oldestLSN;
 	bool		invalidated = false;
@@ -1539,7 +1552,9 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, dboid, xid, &invalidated))
+		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, dboid,
+										   snapshotConflictHorizon, &invalidated,
+										   cause))
 		{
 			/* if the lock was released, start from scratch */
 			goto restart;
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index ce5842b0db..77166127f2 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -496,7 +496,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 
 	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
 		InvalidateObsoleteReplicationSlots(0, locator.dbOid,
-										   snapshotConflictHorizon);
+										   snapshotConflictHorizon, RS_INVAL_XID);
 }
 
 /*
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 35cafe94bc..1430c7e908 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -231,7 +231,8 @@ extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno, Oid dboid,
-											   TransactionId xid);
+											   TransactionId snapshotConflictHorizon,
+											   ReplicationSlotInvalidationCause cause);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);

v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patchtext/plain; charset=UTF-8; name=v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patchDownload

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index e2c46a6bf6..561dcd33c3 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -51,16 +51,23 @@ sub wait_for_xmins
 # Create the required logical slots on standby.
 sub create_logical_slots
 {
-	my ($node) = @_;
-	$node->create_logical_slot_on_standby($node_primary, 'inactiveslot', 'testdb');
-	$node->create_logical_slot_on_standby($node_primary, 'activeslot', 'testdb');
+	my ($node, $slot_prefix) = @_;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot), 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, qq($active_slot), 'testdb');
 }
 
 # Drop the logical slots on standby.
 sub drop_logical_slots
 {
-	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
-	$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
+	my ($slot_prefix) = @_;
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+
+	$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
+	$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$active_slot')]);
 }
 
 # Acquire one of the standby logical slots created by create_logical_slots().
@@ -68,16 +75,17 @@ sub drop_logical_slots
 # If wait is not true it means we are testing a known failure scenario.
 sub make_slot_active
 {
-	my ($node, $wait, $to_stdout, $to_stderr) = @_;
+	my ($node, $slot_prefix, $wait, $to_stdout, $to_stderr) = @_;
 	my $slot_user_handle;
 
-	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', 'activeslot', '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+	my $active_slot = $slot_prefix . 'activeslot';
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', qq($active_slot), '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
 
 	if ($wait)
 	{
 		# make sure activeslot is in use
 		$node->poll_query_until('testdb',
-			"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+			qq[SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = '$active_slot' AND active_pid IS NOT NULL)]
 		) or die "slot never became active";
 	}
 	return $slot_user_handle;
@@ -104,10 +112,10 @@ sub check_pg_recvlogical_stderr
 # that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
 sub check_slots_dropped
 {
-	my ($slot_user_handle) = @_;
+	my ($slot_prefix, $slot_user_handle) = @_;
 
-	is($node_standby->slot('inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
-	is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+	is($node_standby->slot($slot_prefix . 'inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot($slot_prefix . 'activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
 
 	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
 }
@@ -170,6 +178,51 @@ sub check_slots_conflicting_status
 	}
 }
 
+# Drop the slots, re-create them, change hot_standby_feedback,
+# check xmin and catalog_xmin values, make slot active and reset stat.
+sub reactive_slots_change_hfs_and_wait_for_xmins
+{
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+
+	# drop the logical slots
+	drop_logical_slots($previous_slot_prefix);
+
+	# create the logical slots
+	create_logical_slots($node_standby, $slot_prefix);
+
+	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
+
+	$handle = make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+
+	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
+	$node_standby->psql('testdb', q[select pg_stat_reset();]);
+}
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+sub check_for_invalidation
+{
+	my ($slot_prefix, $log_start, $test_name) = @_;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+
+	# message should be issued
+	ok( find_in_log(
+		$node_standby,
+		"invalidating obsolete replication slot \"$inactive_slot\"", $log_start),
+		"inactiveslot slot invalidation is logged $test_name");
+
+	ok( find_in_log(
+		$node_standby,
+		"invalidating obsolete replication slot \"$active_slot\"", $log_start),
+		"activeslot slot invalidation is logged $test_name");
+
+	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+}
 ########################
 # Initialize primary node
 ########################
@@ -233,7 +286,7 @@ $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 ##################################################
 
 # create the logical slots
-create_logical_slots($node_standby);
+create_logical_slots($node_standby, 'behaves_ok_');
 
 $node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
 $node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
@@ -241,7 +294,7 @@ $node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s
 $node_primary->wait_for_replay_catchup($node_standby);
 
 my $result = $node_standby->safe_psql('testdb',
-	qq[SELECT pg_logical_slot_get_changes('activeslot', NULL, NULL);]);
+	qq[SELECT pg_logical_slot_get_changes('behaves_ok_activeslot', NULL, NULL);]);
 
 # test if basic decoding works
 is(scalar(my @foobar = split /^/m, $result),
@@ -263,13 +316,13 @@ COMMIT};
 $node_primary->wait_for_replay_catchup($node_standby);
 
 my $stdout_sql = $node_standby->safe_psql('testdb',
-	qq[SELECT data FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+	qq[SELECT data FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
 );
 
 is($stdout_sql, $expected, 'got expected output from SQL decoding session');
 
 my $endpos = $node_standby->safe_psql('testdb',
-	"SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+	"SELECT lsn FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
 );
 
 # Insert some rows after $endpos, which we won't read.
@@ -280,7 +333,7 @@ $node_primary->safe_psql('testdb',
 $node_primary->wait_for_catchup($node_standby);
 
 my $stdout_recv = $node_standby->pg_recvlogical_upto(
-    'testdb', 'activeslot', $endpos, $default_timeout,
+    'testdb', 'behaves_ok_activeslot', $endpos, $default_timeout,
     'include-xids'     => '0',
     'skip-empty-xacts' => '1');
 chomp($stdout_recv);
@@ -288,11 +341,11 @@ is($stdout_recv, $expected,
     'got same expected output from pg_recvlogical decoding session');
 
 $node_standby->poll_query_until('testdb',
-	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NULL)"
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'behaves_ok_activeslot' AND active_pid IS NULL)"
 ) or die "slot never became inactive";
 
 $stdout_recv = $node_standby->pg_recvlogical_upto(
-    'testdb', 'activeslot', $endpos, $default_timeout,
+    'testdb', 'behaves_ok_activeslot', $endpos, $default_timeout,
     'include-xids'     => '0',
     'skip-empty-xacts' => '1');
 chomp($stdout_recv);
@@ -302,28 +355,20 @@ $node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
 
 is( $node_primary->psql(
         'otherdb',
-        "SELECT lsn FROM pg_logical_slot_peek_changes('activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+        "SELECT lsn FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
     ),
     3,
     'replaying logical slot from another database fails');
 
-# drop the logical slots
-drop_logical_slots();
-
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
 ##################################################
 
-# create the logical slots
-create_logical_slots($node_standby);
-
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum full on pg_class with hot_standby_feedback turned off on
 # the standby.
-change_hot_standby_feedback_and_wait_for_xmins(0,1);
-
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_', 0, 1);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
@@ -332,34 +377,33 @@ $node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
-# message should be issued
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"inactiveslot\""),
-  'inactiveslot slot invalidation is logged with vacuum FULL on pg_class');
-
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"activeslot\""),
-  'activeslot slot invalidation is logged with vacuum FULL on pg_class');
-
-# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-ok( $node_standby->poll_query_until(
-	'postgres',
-	"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
-	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+$handle = make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
 
 # We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"vacuum_full_activeslot\"");
 
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1,1);
 
+##################################################
+# Verify that invalidated logical slots stay invalidated across a restart.
+##################################################
+$node_standby->restart;
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+##################################################
+# Verify that invalidated logical slots do not lead to retaining WAL
+##################################################
+# XXXXX TODO
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 2: conflict due to row removal with hot_standby_feedback off.
@@ -368,17 +412,9 @@ change_hot_standby_feedback_and_wait_for_xmins(1,1);
 # get the position to search from in the standby logfile
 my $logstart = -s $node_standby->logfile;
 
-# drop the logical slots
-drop_logical_slots();
-
-# create the logical slots
-create_logical_slots($node_standby);
-
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
-change_hot_standby_feedback_and_wait_for_xmins(0,1);
-
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_', 0, 1);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
@@ -387,31 +423,16 @@ $node_primary->safe_psql('testdb', 'VACUUM pg_class;');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
-# message should be issued
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
-  'inactiveslot slot invalidation is logged with vacuum on pg_class');
-
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"activeslot\"", $logstart),
-  'activeslot slot invalidation is logged with vacuum on pg_class');
-
-# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-# we now expect 2 conflicts reported as the counter persist across reloads
-ok( $node_standby->poll_query_until(
-	'postgres',
-	"select (confl_active_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
-	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+$handle = make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
 
 # We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"row_removal_activeslot\"");
 
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a shared catalog table
@@ -421,17 +442,9 @@ check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-# drop the logical slots
-drop_logical_slots();
-
-# create the logical slots
-create_logical_slots($node_standby);
-
 # One way to produce recovery conflict is to create/drop a relation and
 # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
-change_hot_standby_feedback_and_wait_for_xmins(0,1);
-
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 'shared_row_removal_', 0, 1);
 
 # Trigger the conflict. The txid_current() is to ensure there's some WAL
 # record associated with the database, otherwise the wait below does not work
@@ -445,31 +458,16 @@ $node_primary->safe_psql('testdb', 'VACUUM pg_authid;');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
-# message should be issued
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
-  'inactiveslot slot invalidation is logged with vacuum on pg_authid');
-
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"activeslot\"", $logstart),
-  'activeslot slot invalidation is logged with vacuum on pg_authid');
-
-# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-# we now expect 2 conflicts reported as the counter persist across reloads
-ok( $node_standby->poll_query_until(
-	'postgres',
-	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
-	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('shared_row_removal_', $logstart, 'with vacuum on pg_authid');
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout, \$stderr);
 
 # We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"shared_row_removal_activeslot\"");
 
 ##################################################
 # Recovery conflict: Same as Scenario 2 but on a non catalog table
@@ -479,41 +477,30 @@ check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-# drop the logical slots
-drop_logical_slots();
-
-# create the logical slots
-create_logical_slots($node_standby);
-
-# put hot standby feedback to off
-change_hot_standby_feedback_and_wait_for_xmins(0,1);
-
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 'no_conflict_', 0, 1);
 
 # This should not trigger a conflict
-$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
-$node_primary->safe_psql('testdb', qq[INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
-$node_primary->safe_psql('testdb', qq[UPDATE conflict_test set x=1, y=1;]);
-$node_primary->safe_psql('testdb', 'VACUUM conflict_test;');
-
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);
+									  INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
+									  UPDATE conflict_test set x=1, y=1;
+									  VACUUM conflict_test;]);
 $node_primary->wait_for_replay_catchup($node_standby);
 
 # message should not be issued
 ok( !find_in_log(
    $node_standby,
-  "invalidating obsolete slot \"inactiveslot\"", $logstart),
+  "invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
   'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
 
 ok( !find_in_log(
    $node_standby,
-  "invalidating obsolete slot \"activeslot\"", $logstart),
+  "invalidating obsolete slot \"no_conflict_activeslot\"", $logstart),
   'activeslot slot invalidation is not logged with vacuum on conflict_test');
 
 # Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
-# we now still expect 2 conflicts reported as the counter persist across reloads
 ok( $node_standby->poll_query_until(
 	'postgres',
-	"select (confl_active_logicalslot = 3) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	"select (confl_active_logicalslot = 0) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
 	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
 
 # Verify slots are reported as non conflicting in pg_replication_slots
@@ -522,6 +509,9 @@ check_slots_conflicting_status(0);
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 0);
 
+# Restart the standby node to ensure no slots are still active
+$node_standby->restart;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 4: conflict due to on-access pruning.
@@ -530,17 +520,9 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 0);
 # get the position to search from in the standby logfile
 $logstart = -s $node_standby->logfile;
 
-# drop the logical slots
-drop_logical_slots();
-
-# create the logical slots
-create_logical_slots($node_standby);
-
 # One way to produce recovery conflict is to trigger an on-access pruning
 # on a relation marked as user_catalog_table.
-change_hot_standby_feedback_and_wait_for_xmins(0,0);
-
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0, 0);
 
 # This should trigger the conflict
 $node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
@@ -552,31 +534,16 @@ $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
-# message should be issued
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
-  'inactiveslot slot invalidation is logged with on-access pruning');
-
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"activeslot\"", $logstart),
-  'activeslot slot invalidation is logged with on-access pruning');
-
-# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-# we now expect 3 conflicts reported as the counter persist across reloads
-ok( $node_standby->poll_query_until(
-	'postgres',
-	"select (confl_active_logicalslot = 4) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
-	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+$handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
 
 # We are not able to read from the slot as it has been invalidated
-check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"pruning_activeslot\"");
 
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
@@ -590,12 +557,15 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 $logstart = -s $node_standby->logfile;
 
 # drop the logical slots
-drop_logical_slots();
+drop_logical_slots('pruning_');
 
 # create the logical slots
-create_logical_slots($node_standby);
+create_logical_slots($node_standby, 'wal_level_');
+
+$handle = make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
 
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
+# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
+$node_standby->psql('testdb', q[select pg_stat_reset();]);
 
 # Make primary wal_level replica. This will trigger slot conflict.
 $node_primary->append_conf('postgresql.conf',q[
@@ -605,28 +575,13 @@ $node_primary->restart;
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
-# message should be issued
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"inactiveslot\"", $logstart),
-  'inactiveslot slot invalidation is logged due to wal_level');
-
-ok( find_in_log(
-   $node_standby,
-  "invalidating obsolete replication slot \"activeslot\"", $logstart),
-  'activeslot slot invalidation is logged due to wal_level');
-
-# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-# we now expect 5 conflicts reported as the counter persist across reloads
-ok( $node_standby->poll_query_until(
-	'postgres',
-	"select (confl_active_logicalslot = 5) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
-	'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
 
 # Verify slots are reported as conflicting in pg_replication_slots
 check_slots_conflicting_status(1);
 
-$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+$handle = make_slot_active($node_standby, 'wal_level_', 0, \$stdout, \$stderr);
 # We are not able to read from the slot as it requires wal_level at least logical on the primary server
 check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
 
@@ -637,21 +592,22 @@ wal_level = 'logical'
 $node_primary->restart;
 $node_primary->wait_for_replay_catchup($node_standby);
 
-$handle = make_slot_active($node_standby, 0, \$stdout, \$stderr);
+$handle = make_slot_active($node_standby, 'wal_level_', 0, \$stdout, \$stderr);
 # as the slot has been invalidated we should not be able to read
-check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"activeslot\"");
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"wal_level_activeslot\"");
 
 ##################################################
 # DROP DATABASE should drops it's slots, including active slots.
 ##################################################
 
 # drop the logical slots
-drop_logical_slots();
+drop_logical_slots('wal_level_');
 
 # create the logical slots
-create_logical_slots($node_standby);
+create_logical_slots($node_standby, 'drop_db_');
+
+$handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
 
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
 # Create a slot on a database that would not be dropped. This slot should not
 # get dropped.
 $node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
@@ -665,7 +621,7 @@ is($node_standby->safe_psql('postgres',
 	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
 	'database dropped on standby');
 
-check_slots_dropped($handle);
+check_slots_dropped('drop_db', $handle);
 
 is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
 	'otherslot on standby not dropped');
@@ -684,14 +640,14 @@ $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 $node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
 # create the logical slots
-create_logical_slots($node_standby);
+create_logical_slots($node_standby, 'promotion_');
 
 # create the logical slots on the cascading standby too
-create_logical_slots($node_cascading_standby);
+create_logical_slots($node_cascading_standby, 'promotion_');
 
 # Make slots actives
-$handle = make_slot_active($node_standby, 1, \$stdout, \$stderr);
-my $cascading_handle = make_slot_active($node_cascading_standby, 1, \$cascading_stdout, \$cascading_stderr);
+$handle = make_slot_active($node_standby, 'promotion_', 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 'promotion_', 1, \$cascading_stdout, \$cascading_stderr);
 
 # Insert some rows before the promotion
 $node_primary->safe_psql('testdb',
@@ -727,7 +683,7 @@ COMMIT};
 
 # check that we are decoding pre and post promotion inserted rows
 $stdout_sql = $node_standby->safe_psql('testdb',
-	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+	qq[SELECT data FROM pg_logical_slot_peek_changes('promotion_inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
 );
 
 is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
@@ -746,7 +702,7 @@ is($stdout, $expected,
 
 # check that we are decoding pre and post promotion inserted rows on the cascading standby
 $stdout_sql = $node_cascading_standby->safe_psql('testdb',
-	qq[SELECT data FROM pg_logical_slot_peek_changes('inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+	qq[SELECT data FROM pg_logical_slot_peek_changes('promotion_inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
 );
 
 is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');

#336

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#335)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 17:13:13 +0200, Drouvot, Bertrand wrote:

On 4/7/23 9:50 AM, Andres Freund wrote:

I added a check for !invalidated to
ReplicationSlotsComputeRequiredLSN() etc.

looked at 65-0001 and it looks good to me.

Added new patch moving checks for invalid logical slots into
CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
no sense. As far as I can tell the old message in
pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
"never previously reserved WAL"

looked at 65-0002 and it looks good to me.

Split "Handle logical slot conflicts on standby." into two. I'm not sure that
should stay that way, but it made it easier to hack on
InvalidateObsoleteReplicationSlots.

looked at 65-0003 and the others.

Thanks for checking!

Todo:
- write a test that invalidated logical slots stay invalidated across a restart

Done in 65-66-0008 attached.

Cool.

- write a test that invalidated logical slots do not lead to retaining WAL

I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
a standby.

You can do it on the primary and wait for the records to have been applied.

- Further evolve the API of InvalidateObsoleteReplicationSlots()
- pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
- rename xid to snapshotConflictHorizon, that'd be more in line with the
ResolveRecoveryConflictWithSnapshot and easier to understand, I think

Done. The new API can be found in v65-66-InvalidateObsoleteReplicationSlots_API.patch
attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a switch/case
can now be used.

Integrated. I moved the cause to the first argument, makes more sense to me
that way.

The "default" case does not emit an error since this code runs as part
of checkpoint.

I made it an error - it's a programming error, not some data level
inconsistency if that ever happens.

- The test could stand a bit of cleanup and consolidation
- No need to start 4 psql processes to do 4 updates, just do it in one
safe_psql()

Right, done in v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch attached.

- the sequence of drop_logical_slots(), create_logical_slots(),
change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
repeated quite a few times

grouped in reactive_slots_change_hfs_and_wait_for_xmins() in 65-66-0008 attached.

- the stats queries checking for specific conflict counts, including
preceding tests, is pretty painful. I suggest to reset the stats at the
end of the test instead (likely also do the drop_logical_slot() there).

Good idea, done in 65-66-0008 attached.

- it's hard to correlate postgres log and the tap test, because the slots
are named the same across all tests. Perhaps they could have a per-test
prefix?

Good point. Done in 65-66-0008 attached. Thanks to that and the stats reset the
check for invalidation is now done in a single function "check_for_invalidation" that looks
for invalidation messages in the logfile and in pg_stat_database_conflicts.

Thanks for the suggestions: the TAP test is now easier to read/understand.

Integrated all of these.

I think pg_log_standby_snapshot() should be added in "Allow logical decoding
on standby", not the commit adding the tests.

Is this patchset sufficient to subscribe to a publication on a physical
standby, assuming the publication is created on the primary? If so, we should
have at least a minimal test. If not, we should note that restriction
explicitly.

Greetings,

Andres Freund

#337

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#336)

10 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 08:47:57 -0700, Andres Freund wrote:

Integrated all of these.

Here's my current version. Changes:
- Integrated Bertrand's changes
- polished commit messages of 0001-0003
- edited code comments for 0003, including
InvalidateObsoleteReplicationSlots()'s header
- added a bump of SLOT_VERSION to 0001
- moved addition of pg_log_standby_snapshot() to 0007
- added a catversion bump for pg_log_standby_snapshot()
- moved all the bits dealing with procsignals from 0003 to 0004, now the split
makes sense IMO
- combined a few more sucessive ->safe_psql() calls

I see occasional failures in the tests, particularly in the new test using
pg_authid, but not solely. cfbot also seems to have seen these:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3740

I made a bogus attempt at a workaround for the pg_authid case last night. But
that didn't actually fix anything, it just changed the timing.

I think the issue is that VACUUM does not force WAL to be flushed at the end
(since it does not assign an xid). wait_for_replay_catchup() uses
$node->lsn('flush'), which, due to VACUUM not flushing, can be an LSN from
before VACUUM completed.

The problem can be made more likely by adding pg_usleep(1000000); before
walwriter.c's call to XLogBackgroundFlush().

We probably should introduce some infrastructure in Cluster.pm for this, but
for now I just added a 'flush_wal' table that we insert into after a
VACUUM. That guarantees a WAL flush.

I think some of the patches might have more reviewers than really applicable,
and might also miss some. I'd appreciate if you could go over that...

Greetings,

Andres Freund

Attachments:

va67-0001-Replace-a-replication-slot-s-invalidated_at-LSN.patchtext/x-diff; charset=us-asciiDownload

From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 20:00:07 -0700
Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN with
 an enum

This is mainly useful because the upcoming logical-decoding-on-standby feature
adds further reasons for invalidating slots, and we don't want to end up with
multiple invalidated_* fields, or check different attributes.

Eventually we should consider not resetting restart_lsn when invalidating a
slot due to max_slot_wal_keep_size. But that's a user visible change, so left
for later.

Increases SLOT_VERSION, due to the changed field (with a different alignment,
no less).

Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
---
 src/include/replication/slot.h      | 15 +++++++++++++--
 src/backend/replication/slot.c      | 28 ++++++++++++++++++++++++----
 src/backend/replication/slotfuncs.c |  8 +++-----
 src/tools/pgindent/typedefs.list    |  1 +
 4 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdfe..ebcb637baed 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
 	RS_TEMPORARY
 } ReplicationSlotPersistency;
 
+/*
+ * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
+ * 'invalidated' field is set to a value other than _NONE.
+ */
+typedef enum ReplicationSlotInvalidationCause
+{
+	RS_INVAL_NONE,
+	/* required WAL has been removed */
+	RS_INVAL_WAL,
+} ReplicationSlotInvalidationCause;
+
 /*
  * On-Disk data of a replication slot, preserved across restarts.
  */
@@ -72,8 +83,8 @@ typedef struct ReplicationSlotPersistentData
 	/* oldest LSN that might be required by this replication slot */
 	XLogRecPtr	restart_lsn;
 
-	/* restart_lsn is copied here when the slot is invalidated */
-	XLogRecPtr	invalidated_at;
+	/* RS_INVAL_NONE if valid, or the reason for having been invalidated */
+	ReplicationSlotInvalidationCause invalidated;
 
 	/*
 	 * Oldest LSN that the client has acked receipt for.  This is used as the
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2293c0c6fc3..df23b7ed31e 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -89,7 +89,7 @@ typedef struct ReplicationSlotOnDisk
 	sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskConstantSize
 
 #define SLOT_MAGIC		0x1051CA1	/* format identifier */
-#define SLOT_VERSION	2		/* version for new files */
+#define SLOT_VERSION	3		/* version for new files */
 
 /* Control array for replication slot management */
 ReplicationSlotCtlData *ReplicationSlotCtl = NULL;
@@ -855,8 +855,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 		SpinLockAcquire(&s->mutex);
 		effective_xmin = s->effective_xmin;
 		effective_catalog_xmin = s->effective_catalog_xmin;
-		invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
-					   XLogRecPtrIsInvalid(s->data.restart_lsn));
+		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
 		/* invalidated slots need not apply */
@@ -901,14 +900,20 @@ ReplicationSlotsComputeRequiredLSN(void)
 	{
 		ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
 		XLogRecPtr	restart_lsn;
+		bool		invalidated;
 
 		if (!s->in_use)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
+		/* invalidated slots need not apply */
+		if (invalidated)
+			continue;
+
 		if (restart_lsn != InvalidXLogRecPtr &&
 			(min_required == InvalidXLogRecPtr ||
 			 restart_lsn < min_required))
@@ -946,6 +951,7 @@ ReplicationSlotsComputeLogicalRestartLSN(void)
 	{
 		ReplicationSlot *s;
 		XLogRecPtr	restart_lsn;
+		bool		invalidated;
 
 		s = &ReplicationSlotCtl->replication_slots[i];
 
@@ -960,8 +966,13 @@ ReplicationSlotsComputeLogicalRestartLSN(void)
 		/* read once, it's ok if it increases while we're checking */
 		SpinLockAcquire(&s->mutex);
 		restart_lsn = s->data.restart_lsn;
+		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
+		/* invalidated slots need not apply */
+		if (invalidated)
+			continue;
+
 		if (restart_lsn == InvalidXLogRecPtr)
 			continue;
 
@@ -1012,6 +1023,8 @@ ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive)
 		if (s->data.database != dboid)
 			continue;
 
+		/* NB: intentionally counting invalidated slots */
+
 		/* count slots with spinlock held */
 		SpinLockAcquire(&s->mutex);
 		(*nslots)++;
@@ -1069,6 +1082,8 @@ restart:
 		if (s->data.database != dboid)
 			continue;
 
+		/* NB: intentionally including invalidated slots */
+
 		/* acquire slot, so ReplicationSlotDropAcquired can be reused  */
 		SpinLockAcquire(&s->mutex);
 		/* can't change while ReplicationSlotControlLock is held */
@@ -1294,7 +1309,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated_at = restart_lsn;
+			s->data.invalidated = RS_INVAL_WAL;
+
+			/*
+			 * XXX: We should consider not overwriting restart_lsn and instead
+			 * just rely on .invalidated.
+			 */
 			s->data.restart_lsn = InvalidXLogRecPtr;
 
 			/* Let caller know */
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2f3c9648241..ad3e72be5ee 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -315,12 +315,10 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 			nulls[i++] = true;
 
 		/*
-		 * If invalidated_at is valid and restart_lsn is invalid, we know for
-		 * certain that the slot has been invalidated.  Otherwise, test
-		 * availability from restart_lsn.
+		 * If the slot has not been invalidated, test availability from
+		 * restart_lsn.
 		 */
-		if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
-			!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+		if (slot_contents.data.invalidated != RS_INVAL_NONE)
 			walstate = WALAVAIL_REMOVED;
 		else
 			walstate = GetWALAvailability(slot_contents.data.restart_lsn);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3219ea5f059..e401b3fe7d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2337,6 +2337,7 @@ ReplicaIdentityStmt
 ReplicationKind
 ReplicationSlot
 ReplicationSlotCtlData
+ReplicationSlotInvalidationCause
 ReplicationSlotOnDisk
 ReplicationSlotPersistency
 ReplicationSlotPersistentData
-- 
2.38.0

va67-0002-Prevent-use-of-invalidated-logical-slot-in-Crea.patchtext/x-diff; charset=us-asciiDownload

From 103f493b26f7af3559f1583fe199623ccc7952f7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 19:51:06 -0700
Subject: [PATCH va67 2/9] Prevent use of invalidated logical slot in
 CreateDecodingContext()

Previously we had checks for this in multiple places. Support for logical
decoding on standbys will add other forms of invalidation, making it worth
while to centralize the checks.

This slightly changes the error message for both the walsender and SQL
interface. Particularly the SQL interface error was inaccurate, as the "This
slot has never previously reserved WAL" portion was unreachable.

Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
---
 src/backend/replication/logical/logical.c      | 16 ++++++++++++++++
 src/backend/replication/logical/logicalfuncs.c | 13 -------------
 src/backend/replication/walsender.c            |  7 -------
 3 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c3ec97a0a62..85fc49f655d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -518,6 +518,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 				 errmsg("replication slot \"%s\" was not created in this database",
 						NameStr(slot->data.name))));
 
+	/*
+	 * Check if slot has been invalidated due to max_slot_wal_keep_size. Avoid
+	 * "cannot get changes" wording in this errmsg because that'd be
+	 * confusingly ambiguous about no changes being available when called from
+	 * pg_logical_slot_get_changes_guts().
+	 */
+	if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("can no longer get changes from replication slot \"%s\"",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("This slot has been invalidated because it exceeded the maximum reserved size.")));
+
+	Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
+	Assert(MyReplicationSlot->data.restart_lsn != InvalidXLogRecPtr);
+
 	if (start_lsn == InvalidXLogRecPtr)
 	{
 		/* continue from last position */
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index fa1b641a2b0..55a24c02c94 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -214,19 +214,6 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 									LogicalOutputPrepareWrite,
 									LogicalOutputWrite, NULL);
 
-		/*
-		 * After the sanity checks in CreateDecodingContext, make sure the
-		 * restart_lsn is valid.  Avoid "cannot get changes" wording in this
-		 * errmsg because that'd be confusingly ambiguous about no changes
-		 * being available.
-		 */
-		if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("can no longer get changes from replication slot \"%s\"",
-							NameStr(*name)),
-					 errdetail("This slot has never previously reserved WAL, or it has been invalidated.")));
-
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e248..e40a9b1ba7b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1253,13 +1253,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true);
 
-	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot read from logical replication slot \"%s\"",
-						cmd->slotname),
-				 errdetail("This slot has been invalidated because it exceeded the maximum reserved size.")));
-
 	/*
 	 * Force a disconnect, so that the decoding code doesn't need to care
 	 * about an eventual switch from running in recovery, to running in a
-- 
2.38.0

va67-0003-Support-invalidating-replication-slots-due-to-h.patchtext/x-diff; charset=iso-8859-1Download

From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Apr 2023 09:32:48 -0700
Subject: [PATCH va67 3/9] Support invalidating replication slots due to
 horizon and wal_level
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Needed for supporting logical decoding on a standby. The new invalidation
methods will be used in a subsequent commit.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: FabrÃzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
---
 src/include/replication/slot.h            |   9 +-
 src/backend/access/transam/xlog.c         |   6 +-
 src/backend/replication/logical/logical.c |   7 +
 src/backend/replication/slot.c            | 151 ++++++++++++++++++----
 4 files changed, 144 insertions(+), 29 deletions(-)

diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index ebcb637baed..bfc84193a7a 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -46,6 +46,10 @@ typedef enum ReplicationSlotInvalidationCause
 	RS_INVAL_NONE,
 	/* required WAL has been removed */
 	RS_INVAL_WAL,
+	/* required rows have been removed */
+	RS_INVAL_HORIZON,
+	/* wal_level insufficient for slot */
+	RS_INVAL_WAL_LEVEL,
 } ReplicationSlotInvalidationCause;
 
 /*
@@ -226,7 +230,10 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern bool InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
+extern bool InvalidateObsoleteReplicationSlots(ReplicationSlotInvalidationCause cause,
+											   XLogSegNo oldestSegno,
+											   Oid dboid,
+											   TransactionId snapshotConflictHorizon);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name, bool need_lock);
 extern int	ReplicationSlotIndex(ReplicationSlot *slot);
 extern bool ReplicationSlotName(int index, Name name);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46821ad6056..1485e8f9ca9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6806,7 +6806,8 @@ CreateCheckPoint(int flags)
 	 */
 	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 	KeepLogSeg(recptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL, _logSegNo, InvalidOid,
+										   InvalidTransactionId))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
@@ -7250,7 +7251,8 @@ CreateRestartPoint(int flags)
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
-	if (InvalidateObsoleteReplicationSlots(_logSegNo))
+	if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL, _logSegNo, InvalidOid,
+										   InvalidTransactionId))
 	{
 		/*
 		 * Some slots have been invalidated; recalculate the old-segment
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 85fc49f655d..27addd58f66 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -531,6 +531,13 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 						NameStr(MyReplicationSlot->data.name)),
 				 errdetail("This slot has been invalidated because it exceeded the maximum reserved size.")));
 
+	if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from logical replication slot \"%s\"",
+						NameStr(MyReplicationSlot->data.name)),
+				 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
+
 	Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
 	Assert(MyReplicationSlot->data.restart_lsn != InvalidXLogRecPtr);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index df23b7ed31e..c2a9accebf6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void)
 }
 
 /*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Report that replication slot needs to be invalidated
+ */
+static void
+ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
+					   bool terminating,
+					   int pid,
+					   NameData slotname,
+					   XLogRecPtr restart_lsn,
+					   XLogRecPtr oldestLSN,
+					   TransactionId snapshotConflictHorizon)
+{
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	switch (cause)
+	{
+		case RS_INVAL_WAL:
+			hint = true;
+			appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+							 LSN_FORMAT_ARGS(restart_lsn),
+							 (unsigned long long) (oldestLSN - restart_lsn));
+			break;
+		case RS_INVAL_HORIZON:
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."),
+							 snapshotConflictHorizon);
+			break;
+
+		case RS_INVAL_WAL_LEVEL:
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+			break;
+		case RS_INVAL_NONE:
+			pg_unreachable();
+	}
+
+	ereport(LOG,
+			terminating ?
+			errmsg("terminating process %d to release replication slot \"%s\"",
+				   pid, NameStr(slotname)) :
+			errmsg("invalidating obsolete replication slot \"%s\"",
+				   NameStr(slotname)),
+			errdetail_internal("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_detail.data);
+}
+
+/*
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
  *
  * Returns whether ReplicationSlotControlLock was released in the interim (and
  * in that case we're not holding the lock at return, otherwise we are).
@@ -1253,7 +1303,10 @@ ReplicationSlotReserveWal(void)
  * for syscalls, so caller must restart if we return true.
  */
 static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
+							   ReplicationSlot *s,
+							   XLogRecPtr oldestLSN,
+							   Oid dboid, TransactionId snapshotConflictHorizon,
 							   bool *invalidated)
 {
 	int			last_signaled_pid = 0;
@@ -1264,6 +1317,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		XLogRecPtr	restart_lsn;
 		NameData	slotname;
 		int			active_pid = 0;
+		ReplicationSlotInvalidationCause conflict = RS_INVAL_NONE;
 
 		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
 
@@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		restart_lsn = s->data.restart_lsn;
 
 		/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
 		 */
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (s->data.invalidated == RS_INVAL_NONE)
+		{
+			switch (cause)
+			{
+				case RS_INVAL_WAL:
+					if (s->data.restart_lsn != InvalidXLogRecPtr &&
+						s->data.restart_lsn < oldestLSN)
+						conflict = cause;
+					break;
+				case RS_INVAL_HORIZON:
+					if (!SlotIsLogical(s))
+						break;
+					/* invalid DB oid signals a shared relation */
+					if (dboid != InvalidOid && dboid != s->data.database)
+						break;
+					if (TransactionIdIsValid(s->effective_xmin) &&
+						TransactionIdPrecedesOrEquals(s->effective_xmin,
+													  snapshotConflictHorizon))
+						conflict = cause;
+					else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
+							 TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
+														   snapshotConflictHorizon))
+						conflict = cause;
+					break;
+				case RS_INVAL_WAL_LEVEL:
+					if (SlotIsLogical(s))
+						conflict = cause;
+					break;
+				default:
+					pg_unreachable();
+					break;
+			}
+		}
+
+		/* if there's no conflict, we're done */
+		if (conflict == RS_INVAL_NONE)
 		{
 			SpinLockRelease(&s->mutex);
 			if (released_lock)
@@ -1309,13 +1398,14 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 		{
 			MyReplicationSlot = s;
 			s->active_pid = MyProcPid;
-			s->data.invalidated = RS_INVAL_WAL;
+			s->data.invalidated = conflict;
 
 			/*
 			 * XXX: We should consider not overwriting restart_lsn and instead
 			 * just rely on .invalidated.
 			 */
-			s->data.restart_lsn = InvalidXLogRecPtr;
+			if (conflict == RS_INVAL_WAL)
+				s->data.restart_lsn = InvalidXLogRecPtr;
 
 			/* Let caller know */
 			*invalidated = true;
@@ -1349,13 +1439,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			 */
 			if (last_signaled_pid != active_pid)
 			{
-				ereport(LOG,
-						errmsg("terminating process %d to release replication slot \"%s\"",
-							   active_pid, NameStr(slotname)),
-						errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-								  LSN_FORMAT_ARGS(restart_lsn),
-								  (unsigned long long) (oldestLSN - restart_lsn)),
-						errhint("You might need to increase max_slot_wal_keep_size."));
+				ReportSlotInvalidation(conflict, true, active_pid,
+									   slotname, restart_lsn,
+									   oldestLSN, snapshotConflictHorizon);
 
 				(void) kill(active_pid, SIGTERM);
 				last_signaled_pid = active_pid;
@@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
+			pgstat_drop_replslot(s);
 
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportSlotInvalidation(conflict, false, active_pid,
+								   slotname, restart_lsn,
+								   oldestLSN, snapshotConflictHorizon);
 
 			/* done with this slot for now */
 			break;
@@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 }
 
 /*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate slots that require resources about to be removed.
  *
  * Returns true when any slot have got invalidated.
  *
+ * Whether a slot needs to be invalidated depends on the cause. A slot is
+ * removed if it:
+ * - RS_INVAL_WAL: requires a LSN older than the given segment
+ * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the given db
+     dboid may be InvalidOid for shared relations
+ * - RS_INVAL_WAL_LEVEL: is logical
+ *
  * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
 bool
-InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+InvalidateObsoleteReplicationSlots(ReplicationSlotInvalidationCause cause,
+								   XLogSegNo oldestSegno, Oid dboid,
+								   TransactionId snapshotConflictHorizon)
 {
 	XLogRecPtr	oldestLSN;
 	bool		invalidated = false;
 
+	Assert(cause != RS_INVAL_HORIZON || TransactionIdIsValid(snapshotConflictHorizon));
+	Assert(cause != RS_INVAL_WAL || oldestSegno > 0);
+
+	if (max_replication_slots == 0)
+		return invalidated;
+
 	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
@@ -1434,7 +1531,9 @@ restart:
 		if (!s->in_use)
 			continue;
 
-		if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
+		if (InvalidatePossiblyObsoleteSlot(cause, s, oldestLSN, dboid,
+										   snapshotConflictHorizon,
+										   &invalidated))
 		{
 			/* if the lock was released, start from scratch */
 			goto restart;
-- 
2.38.0

va67-0004-Handle-logical-slot-conflicts-on-standby.patchtext/x-diff; charset=iso-8859-1Download

From 311a1d8f9c2d1acf0c22e091d53f7a533073c8b7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Apr 2023 09:56:02 -0700
Subject: [PATCH va67 4/9] Handle logical slot conflicts on standby
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified, invalidate
such slots. Also do the same thing if wal_level on the primary server is
reduced to below logical and there are existing logical slots on
standby. Introduce a new ProcSignalReason value for slot conflict recovery.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: FabrÃzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
---
 src/include/storage/procsignal.h     |  1 +
 src/include/storage/standby.h        |  2 ++
 src/backend/access/gist/gistxlog.c   |  2 ++
 src/backend/access/hash/hash_xlog.c  |  1 +
 src/backend/access/heap/heapam.c     |  3 +++
 src/backend/access/nbtree/nbtxlog.c  |  2 ++
 src/backend/access/spgist/spgxlog.c  |  1 +
 src/backend/access/transam/xlog.c    | 15 +++++++++++++++
 src/backend/replication/slot.c       |  8 +++++++-
 src/backend/storage/ipc/procsignal.c |  3 +++
 src/backend/storage/ipc/standby.c    | 20 +++++++++++++++++++-
 src/backend/tcop/postgres.c          |  9 +++++++++
 12 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231ba..2f52100b009 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126f..41f4dc372e6 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
 extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
 												RelFileLocator locator);
 extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
 													   RelFileLocator locator);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c144..9a86fb3feff 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3f..e8e06c62a95 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8b13e3f8925..f389ceee1ea 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8769,6 +8769,7 @@ heap_xlog_prune(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 
 	/*
@@ -8940,6 +8941,7 @@ heap_xlog_visible(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
 											rlocator);
 
 	/*
@@ -9061,6 +9063,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6deb..c87e46ed66e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
 		XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
 
 		ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
 											rlocator);
 	}
 
@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 
 	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+												   xlrec->isCatalogRel,
 												   xlrec->locator);
 }
 
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8ac..459ac929ba5 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 
 		XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
 		ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+											xldata->isCatalogRel,
 											locator);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1485e8f9ca9..5227fc675c8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7965,6 +7965,21 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL,
+											   0, InvalidOid,
+											   InvalidTransactionId);
+
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c2a9accebf6..1b1b51e21ed 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1443,7 +1443,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 									   slotname, restart_lsn,
 									   oldestLSN, snapshotConflictHorizon);
 
-				(void) kill(active_pid, SIGTERM);
+				if (MyBackendType == B_STARTUP)
+					(void) SendProcSignal(active_pid,
+										  PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
+										  InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
 				last_signaled_pid = active_pid;
 			}
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf6909..c85cb5cc18d 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -673,6 +673,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
+	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT))
+		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT);
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95cf..3b5d654347e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/slot.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
  */
 void
 ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
 									RelFileLocator locator)
 {
 	VirtualTransactionId *backends;
@@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 										   true);
+
+	/*
+	 * Note that WaitExceedsMaxStandbyDelay() is not taken into account here
+	 * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That
+	 * seems OK, given that this kind of conflict should not normally be
+	 * reached, e.g. by using a physical replication slot.
+	 */
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, locator.dbOid,
+										   snapshotConflictHorizon);
 }
 
 /*
@@ -499,6 +511,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
  */
 void
 ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+										   bool isCatalogRel,
 										   RelFileLocator locator)
 {
 	/*
@@ -517,7 +530,9 @@ ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHor
 		TransactionId truncated;
 
 		truncated = XidFromFullTransactionId(snapshotConflictHorizon);
-		ResolveRecoveryConflictWithSnapshot(truncated, locator);
+		ResolveRecoveryConflictWithSnapshot(truncated,
+											isCatalogRel,
+											locator);
 	}
 }
 
@@ -1478,6 +1493,9 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			reasonDesc = _("recovery conflict on snapshot");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			reasonDesc = _("recovery conflict on replication slot");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			reasonDesc = _("recovery conflict on buffer deadlock");
 			break;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a10ecbaf50b..25e0de4e0ff 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2526,6 +2526,9 @@ errdetail_recovery_conflict(void)
 		case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
 			errdetail("User query might have needed to see row versions that must be removed.");
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			errdetail("User was using the logical slot that must be dropped.");
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			errdetail("User transaction caused buffer deadlock with recovery.");
 			break;
@@ -3143,6 +3146,12 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				InterruptPending = true;
 				break;
 
+			case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+				RecoveryConflictPending = true;
+				QueryCancelPending = true;
+				InterruptPending = true;
+				break;
+
 			default:
 				elog(FATAL, "unrecognized conflict mode: %d",
 					 (int) reason);
-- 
2.38.0

va67-0005-Arrange-for-a-new-pg_stat_database_conflicts-an.patchtext/x-diff; charset=us-asciiDownload

From c2e78e0ae42aae550a07d45da07461b4c8a26684 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 11:28:30 +0000
Subject: [PATCH va67 5/9] Arrange for a new pg_stat_database_conflicts and
 pg_replication_slots field

As we handled logical slot conflicts on standby on the previous commit, we
can expose the conflict in pg_stat_database_conflicts and pg_replication_slots.

Adding:

- confl_active_logicalslot in pg_stat_database_conflicts
- conflicting in pg_replication_slots

to do so.
---
 src/include/catalog/pg_proc.dat              | 11 ++++++++---
 src/include/pgstat.h                         |  1 +
 src/backend/catalog/system_views.sql         |  6 ++++--
 src/backend/replication/slotfuncs.c          | 12 +++++++++++-
 src/backend/utils/activity/pgstat_database.c |  4 ++++
 src/backend/utils/adt/pgstatfuncs.c          |  3 +++
 doc/src/sgml/monitoring.sgml                 | 11 +++++++++++
 doc/src/sgml/system-views.sgml               | 10 ++++++++++
 src/test/regress/expected/rules.out          |  8 +++++---
 9 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f64bc68276a..016354d75c5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5611,6 +5611,11 @@
   proname => 'pg_stat_get_db_conflict_snapshot', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_conflict_snapshot' },
+{ oid => '9901',
+  descr => 'statistics: recovery conflicts in database caused by logical replication slot',
+  proname => 'pg_stat_get_db_conflict_logicalslot', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_db_conflict_logicalslot' },
 { oid => '3068',
   descr => 'statistics: recovery conflicts in database caused by shared buffer pin',
   proname => 'pg_stat_get_db_conflict_bufferpin', provolatile => 's',
@@ -11077,9 +11082,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,safe_wal_size,two_phase,conflicting}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e79b8a34ebc..5e8b04d21b1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -331,6 +331,7 @@ typedef struct PgStat_StatDBEntry
 	PgStat_Counter conflict_tablespace;
 	PgStat_Counter conflict_lock;
 	PgStat_Counter conflict_snapshot;
+	PgStat_Counter conflict_logicalslot;
 	PgStat_Counter conflict_bufferpin;
 	PgStat_Counter conflict_startup_deadlock;
 	PgStat_Counter temp_files;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6b098234f8c..c25067d06de 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -999,7 +999,8 @@ CREATE VIEW pg_replication_slots AS
             L.confirmed_flush_lsn,
             L.wal_status,
             L.safe_wal_size,
-            L.two_phase
+            L.two_phase,
+            L.conflicting
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
@@ -1067,7 +1068,8 @@ CREATE VIEW pg_stat_database_conflicts AS
             pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
             pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
             pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
-            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
+            pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock,
+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_active_logicalslot
     FROM pg_database D;
 
 CREATE VIEW pg_stat_user_functions AS
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ad3e72be5ee..6035cf48160 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -232,7 +232,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 14
+#define PG_GET_REPLICATION_SLOTS_COLS 15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -402,6 +402,16 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		if (slot_contents.data.database == InvalidOid)
+			nulls[i++] = true;
+		else
+		{
+			if (slot_contents.data.invalidated != RS_INVAL_NONE)
+				values[i++] = BoolGetDatum(true);
+			else
+				values[i++] = BoolGetDatum(false);
+		}
+
 		Assert(i == PG_GET_REPLICATION_SLOTS_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 6e650ceaade..7149f22f729 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -109,6 +109,9 @@ pgstat_report_recovery_conflict(int reason)
 		case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 			dbentry->conflict_bufferpin++;
 			break;
+		case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
+			dbentry->conflict_logicalslot++;
+			break;
 		case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
 			dbentry->conflict_startup_deadlock++;
 			break;
@@ -387,6 +390,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	PGSTAT_ACCUM_DBCOUNT(conflict_tablespace);
 	PGSTAT_ACCUM_DBCOUNT(conflict_lock);
 	PGSTAT_ACCUM_DBCOUNT(conflict_snapshot);
+	PGSTAT_ACCUM_DBCOUNT(conflict_logicalslot);
 	PGSTAT_ACCUM_DBCOUNT(conflict_bufferpin);
 	PGSTAT_ACCUM_DBCOUNT(conflict_startup_deadlock);
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b1..4de60d8aa14 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1066,6 +1066,8 @@ PG_STAT_GET_DBENTRY_INT64(xact_commit)
 /* pg_stat_get_db_xact_rollback */
 PG_STAT_GET_DBENTRY_INT64(xact_rollback)
 
+/* pg_stat_get_db_conflict_logicalslot */
+PG_STAT_GET_DBENTRY_INT64(conflict_logicalslot)
 
 Datum
 pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
@@ -1099,6 +1101,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
 		result = (int64) (dbentry->conflict_tablespace +
 						  dbentry->conflict_lock +
 						  dbentry->conflict_snapshot +
+						  dbentry->conflict_logicalslot +
 						  dbentry->conflict_bufferpin +
 						  dbentry->conflict_startup_deadlock);
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bce9ae46615..fa3b0f810cd 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4674,6 +4674,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        deadlocks
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a4184508..57b228076e8 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2517,6 +2517,16 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
        false for physical slots.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>conflicting</structfield> <type>bool</type>
+      </para>
+      <para>
+       True if this logical slot conflicted with recovery (and so is now
+       invalidated). Always NULL for physical slots.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ab1aebfde42..06d3f1f5d34 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1472,8 +1472,9 @@ pg_replication_slots| SELECT l.slot_name,
     l.confirmed_flush_lsn,
     l.wal_status,
     l.safe_wal_size,
-    l.two_phase
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase)
+    l.two_phase,
+    l.conflicting
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size, two_phase, conflicting)
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
@@ -1869,7 +1870,8 @@ pg_stat_database_conflicts| SELECT oid AS datid,
     pg_stat_get_db_conflict_lock(oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(oid) AS confl_snapshot,
     pg_stat_get_db_conflict_bufferpin(oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(oid) AS confl_active_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT pid,
     gss_auth AS gss_authenticated,
-- 
2.38.0

va67-0006-For-cascading-replication-wake-physical-and-log.patchtext/x-diff; charset=us-asciiDownload

From ae5a0271e17471f16566757bf298fec8c8a63432 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Apr 2023 16:46:09 +0000
Subject: [PATCH va67 6/9] For cascading replication, wake physical and logical
 walsenders separately

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Per idea from Jeff Davis and Amit Kapila.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
---
 src/include/replication/walsender.h         | 22 +++++------
 src/include/replication/walsender_private.h |  3 ++
 src/backend/access/transam/xlog.c           |  6 +--
 src/backend/access/transam/xlogarchive.c    |  2 +-
 src/backend/access/transam/xlogrecovery.c   | 30 +++++++++++---
 src/backend/replication/walreceiver.c       |  2 +-
 src/backend/replication/walsender.c         | 43 +++++++++++++++++----
 7 files changed, 79 insertions(+), 29 deletions(-)

diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae3..9df7e50f943 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -42,7 +42,7 @@ extern void WalSndResourceCleanup(bool isCommit);
 extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
-extern void WalSndWakeup(void);
+extern void WalSndWakeup(bool physical, bool logical);
 extern void WalSndInitStopping(void);
 extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
@@ -60,15 +60,15 @@ extern void WalSndRqstFileReload(void);
 /*
  * wakeup walsenders if there is work to be done
  */
-#define WalSndWakeupProcessRequests()		\
-	do										\
-	{										\
-		if (wake_wal_senders)				\
-		{									\
-			wake_wal_senders = false;		\
-			if (max_wal_senders > 0)		\
-				WalSndWakeup();				\
-		}									\
-	} while (0)
+static inline void
+WalSndWakeupProcessRequests(bool physical, bool logical)
+{
+	if (wake_wal_senders)
+	{
+		wake_wal_senders = false;
+		if (max_wal_senders > 0)
+			WalSndWakeup(physical, logical);
+	}
+}
 
 #endif							/* _WALSENDER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c48..ff25aa70a89 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
+#include "nodes/replnodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
@@ -79,6 +80,8 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	ReplicationKind kind;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5227fc675c8..83ab70879bf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2645,7 +2645,7 @@ XLogFlush(XLogRecPtr record)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * If we still haven't flushed to the request point then we have a
@@ -2816,7 +2816,7 @@ XLogBackgroundFlush(void)
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
-	WalSndWakeupProcessRequests();
+	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
@@ -5762,7 +5762,7 @@ StartupXLOG(void)
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, true);
 
 	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a0f5aa24b58..f3fb92c8f96 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -421,7 +421,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 	 * if we restored something other than a WAL segment, but it does no harm
 	 * either.
 	 */
-	WalSndWakeup();
+	WalSndWakeup(true, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe93947627..e6427c54c57 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
+	/*
+	 * Wakeup walsenders:
+	 *
+	 * On the standby, the WAL is flushed first (which will only wake up
+	 * physical walsenders) and then applied, which will only wake up logical
+	 * walsenders.
+	 *
+	 * Indeed, logical walsenders on standby can't decode and send data until
+	 * it's been applied.
+	 *
+	 * Physical walsenders don't need to be woken up during replay unless
+	 * cascading replication is allowed and time line change occured (so that
+	 * they can notice that they are on a new time line).
+	 *
+	 * That's why the wake up conditions are for:
+	 *
+	 *  - physical walsenders in case of new time line and cascade
+	 *  replication is allowed.
+	 *  - logical walsenders in case cascade replication is allowed (could not
+	 *  be created otherwise).
+	 */
+	if (AllowCascadeReplication())
+		WalSndWakeup(switchedTLI, true);
+
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 	 * receiver so that it notices the updated lastReplayedEndRecPtr and sends
@@ -1958,12 +1982,6 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		RemoveNonParentXlogFiles(xlogreader->EndRecPtr, *replayTLI);
 
-		/*
-		 * Wake up any walsenders to notice that we are on a new timeline.
-		 */
-		if (AllowCascadeReplication())
-			WalSndWakeup();
-
 		/* Reset the prefetcher. */
 		XLogPrefetchReconfigure();
 	}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 685af51d5d3..feff7094351 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
-			WalSndWakeup();
+			WalSndWakeup(true, false);
 
 		/* Report XLOG streaming progress in PS display */
 		if (update_process_title)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e40a9b1ba7b..66493b6e896 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2603,6 +2603,23 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+
+			/*
+			 * The kind assignment is done here and not in StartReplication()
+			 * and StartLogicalReplication(). Indeed, the logical walsender
+			 * needs to read WAL records (like snapshot of running
+			 * transactions) during the slot creation. So it needs to be woken
+			 * up based on its kind.
+			 *
+			 * The kind assignment could also be done in StartReplication(),
+			 * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
+			 * seems better to set it on one place.
+			 */
+			if (MyDatabaseId == InvalidOid)
+				walsnd->kind = REPLICATION_KIND_PHYSICAL;
+			else
+				walsnd->kind = REPLICATION_KIND_LOGICAL;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -3280,30 +3297,42 @@ WalSndShmemInit(void)
 }
 
 /*
- * Wake up all walsenders
+ * Wake up physical, logical or both walsenders kind
+ *
+ * The distinction between physical and logical walsenders is done, because:
+ * - physical walsenders can't send data until it's been flushed
+ * - logical walsenders on standby can't decode and send data until it's been
+ * applied
+ *
+ * For cascading replication we need to wake up physical
+ * walsenders separately from logical walsenders (see the comment before calling
+ * WalSndWakeup() in ApplyWalRecord() for more details).
  *
  * This will be called inside critical sections, so throwing an error is not
  * advisable.
  */
 void
-WalSndWakeup(void)
+WalSndWakeup(bool physical, bool logical)
 {
 	int			i;
 
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		Latch	   *latch;
+		ReplicationKind kind;
 		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes).
-		 */
+		/* get latch pointer and kind with spinlock helds */
 		SpinLockAcquire(&walsnd->mutex);
 		latch = walsnd->latch;
+		kind = walsnd->kind;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (latch != NULL)
+		if (latch == NULL)
+			continue;
+
+		if ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
+			(logical && kind == REPLICATION_KIND_LOGICAL))
 			SetLatch(latch);
 	}
 }
-- 
2.38.0

va67-0007-Allow-logical-decoding-on-standby.patchtext/x-diff; charset=us-asciiDownload

From 572acd14c15704ddafe26fc09b7288151e5115c7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 23:34:22 -0700
Subject: [PATCH va67 7/9] Allow logical decoding on standby

Allow a logical slot to be created on standby. Restrict its usage or its
creation if wal_level on primary is less than logical.  During slot creation,
it's restart_lsn is set to the last replayed LSN. Effectively, a logical slot
creation on standby waits for an xl_running_xact record to arrive from
primary.

This commit also introduces the pg_log_standby_snapshot() function. The idea
is to be able to take a snapshot of running transactions and write this to WAL
without requesting for a (costly) checkpoint. This allows to make it much
faster to create logical slots on a replica.

Bumps catversion, for the addition of the pg_log_standby_snapshot() function.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 src/include/access/xlog.h                 |  1 +
 src/include/catalog/pg_proc.dat           |  3 ++
 src/backend/access/transam/xlog.c         | 11 +++++
 src/backend/access/transam/xlogfuncs.c    | 32 ++++++++++++
 src/backend/catalog/system_functions.sql  |  2 +
 src/backend/replication/logical/decode.c  | 30 +++++++++++-
 src/backend/replication/logical/logical.c | 36 ++++++++------
 src/backend/replication/slot.c            | 59 ++++++++++++-----------
 src/backend/replication/walsender.c       | 48 ++++++++++++------
 doc/src/sgml/func.sgml                    | 15 ++++++
 10 files changed, 176 insertions(+), 61 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738c..48ca8523810 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -230,6 +230,7 @@ extern void XLOGShmemInit(void);
 extern void BootStrapXLOG(void);
 extern void InitializeWalConsistencyChecking(void);
 extern void LocalProcessControlFile(bool reset);
+extern WalLevel GetActiveWalLevelOnStandby(void);
 extern void StartupXLOG(void);
 extern void ShutdownXLOG(int code, Datum arg);
 extern void CreateCheckPoint(int flags);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 016354d75c5..55229759ac6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6432,6 +6432,9 @@
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
+{ oid => '9658', descr => 'log details of the current snapshot to WAL',
+  proname => 'pg_log_standby_snapshot', provolatile => 'v', prorettype => 'pg_lsn',
+  proargtypes => '', prosrc => 'pg_log_standby_snapshot' },
 { oid => '3098', descr => 'create a named restore point',
   proname => 'pg_create_restore_point', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => 'text',
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 83ab70879bf..743a4723a76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4466,6 +4466,17 @@ LocalProcessControlFile(bool reset)
 	ReadControlFile();
 }
 
+/*
+ * Get the wal_level from the control file. For a standby, this value should be
+ * considered as its active wal_level, because it may be different from what
+ * was originally configured on standby.
+ */
+WalLevel
+GetActiveWalLevelOnStandby(void)
+{
+	return ControlFile->wal_level;
+}
+
 /*
  * Initialization of shared memory for XLOG
  */
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index c07daa874f9..36a309b54cc 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -31,6 +31,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -196,6 +197,37 @@ pg_switch_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(switchpoint);
 }
 
+/*
+ * pg_log_standby_snapshot: call LogStandbySnapshot()
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_log_standby_snapshot(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("pg_log_standby_snapshot() cannot be executed during recovery.")));
+
+	if (!XLogStandbyInfoActive())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("wal_level is not in desired state"),
+				 errhint("wal_level has to be >= WAL_LEVEL_REPLICA.")));
+
+	recptr = LogStandbySnapshot();
+
+	/*
+	 * As a convenience, return the WAL location of the last inserted record
+	 */
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * pg_create_restore_point: a named point for restore
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 83ca8934440..b7c65ea37d7 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -644,6 +644,8 @@ REVOKE EXECUTE ON FUNCTION pg_create_restore_point(text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_switch_wal() FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_log_standby_snapshot() FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8fe7bb65f1f..8352dbf5df6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -152,11 +152,39 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * can restart from there.
 			 */
 			break;
+		case XLOG_PARAMETER_CHANGE:
+			{
+				xl_parameter_change *xlrec =
+				(xl_parameter_change *) XLogRecGetData(buf->record);
+
+				/*
+				 * If wal_level on primary is reduced to less than logical,
+				 * then we want to prevent existing logical slots from being
+				 * used. Existing logical slots on standby get invalidated
+				 * when this WAL record is replayed; and further, slot
+				 * creation fails when the wal level is not sufficient; but
+				 * all these operations are not synchronized, so a logical
+				 * slot may creep in while the wal_level is being reduced.
+				 * Hence this extra check.
+				 */
+				if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+				{
+					/*
+					 * This can occur only on a standby, as a primary would
+					 * not allow to restart after changing wal_level < logical
+					 * if there is pre-existing logical slot.
+					 */
+					Assert(RecoveryInProgress());
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+				}
+				break;
+			}
 		case XLOG_NOOP:
 		case XLOG_NEXTOID:
 		case XLOG_SWITCH:
 		case XLOG_BACKUP_END:
-		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 27addd58f66..659b08cd456 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,23 +124,21 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
-	/* ----
-	 * TODO: We got to change that someday soon...
-	 *
-	 * There's basically three things missing to allow this:
-	 * 1) We need to be able to correctly and quickly identify the timeline a
-	 *	  LSN belongs to
-	 * 2) We need to force hot_standby_feedback to be enabled at all times so
-	 *	  the primary cannot remove rows we need.
-	 * 3) support dropping replication slots referring to a database, in
-	 *	  dbase_redo. There can't be any active ones due to HS recovery
-	 *	  conflicts, so that should be relatively easy.
-	 * ----
-	 */
 	if (RecoveryInProgress())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("logical decoding cannot be used while in recovery")));
+	{
+		/*
+		 * This check may have race conditions, but whenever
+		 * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
+		 * verify that there are no existing logical replication slots. And to
+		 * avoid races around creating a new slot,
+		 * CheckLogicalDecodingRequirements() is called once before creating
+		 * the slot, and once when logical decoding is initially starting up.
+		 */
+		if (GetActiveWalLevelOnStandby() < WAL_LEVEL_LOGICAL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("logical decoding on standby requires wal_level to be at least logical on the primary server")));
+	}
 }
 
 /*
@@ -342,6 +340,12 @@ CreateInitDecodingContext(const char *plugin,
 	LogicalDecodingContext *ctx;
 	MemoryContext old_context;
 
+	/*
+	 * On standby, this check is also required while creating the slot. Check
+	 * the comments in this function.
+	 */
+	CheckLogicalDecodingRequirements();
+
 	/* shorter lines... */
 	slot = MyReplicationSlot;
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1b1b51e21ed..513c132f16d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -1192,37 +1193,28 @@ ReplicationSlotReserveWal(void)
 		/*
 		 * For logical slots log a standby snapshot and start logical decoding
 		 * at exactly that position. That allows the slot to start up more
-		 * quickly.
+		 * quickly. But on a standby we cannot do WAL writes, so just use the
+		 * replay pointer; effectively, an attempt to create a logical slot on
+		 * standby will cause it to wait for an xl_running_xact record to be
+		 * logged independently on the primary, so that a snapshot can be
+		 * built using the record.
 		 *
-		 * That's not needed (or indeed helpful) for physical slots as they'll
-		 * start replay at the last logged checkpoint anyway. Instead return
-		 * the location of the last redo LSN. While that slightly increases
-		 * the chance that we have to retry, it's where a base backup has to
-		 * start replay at.
+		 * None of this is needed (or indeed helpful) for physical slots as
+		 * they'll start replay at the last logged checkpoint anyway. Instead
+		 * return the location of the last redo LSN. While that slightly
+		 * increases the chance that we have to retry, it's where a base
+		 * backup has to start replay at.
 		 */
-		if (!RecoveryInProgress() && SlotIsLogical(slot))
-		{
-			XLogRecPtr	flushptr;
-
-			/* start at current insert position */
-			restart_lsn = GetXLogInsertRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-
-			/* make sure we have enough information to start */
-			flushptr = LogStandbySnapshot();
-
-			/* and make sure it's fsynced to disk */
-			XLogFlush(flushptr);
-		}
-		else
-		{
+		if (SlotIsPhysical(slot))
 			restart_lsn = GetRedoRecPtr();
-			SpinLockAcquire(&slot->mutex);
-			slot->data.restart_lsn = restart_lsn;
-			SpinLockRelease(&slot->mutex);
-		}
+		else if (RecoveryInProgress())
+			restart_lsn = GetXLogReplayRecPtr(NULL);
+		else
+			restart_lsn = GetXLogInsertRecPtr();
+
+		SpinLockAcquire(&slot->mutex);
+		slot->data.restart_lsn = restart_lsn;
+		SpinLockRelease(&slot->mutex);
 
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
@@ -1238,6 +1230,17 @@ ReplicationSlotReserveWal(void)
 		if (XLogGetLastRemovedSegno() < segno)
 			break;
 	}
+
+	if (!RecoveryInProgress() && SlotIsLogical(slot))
+	{
+		XLogRecPtr	flushptr;
+
+		/* make sure we have enough information to start */
+		flushptr = LogStandbySnapshot();
+
+		/* and make sure it's fsynced to disk */
+		XLogFlush(flushptr);
+	}
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 66493b6e896..743e338b1b6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -906,23 +906,34 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	int			count;
 	WALReadError errinfo;
 	XLogSegNo	segno;
-	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	TimeLineID	currTLI;
 
 	/*
-	 * Since logical decoding is only permitted on a primary server, we know
-	 * that the current timeline ID can't be changing any more. If we did this
-	 * on a standby, we'd have to worry about the values we compute here
-	 * becoming invalid due to a promotion or timeline change.
+	 * Make sure we have enough WAL available before retrieving the current
+	 * timeline. This is needed to determine am_cascading_walsender accurately
+	 * which is needed to determine the current timeline.
 	 */
+	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+	/*
+	 * Since logical decoding is also permitted on a standby server, we need
+	 * to check if the server is in recovery to decide how to get the current
+	 * timeline ID (so that it also cover the promotion or timeline change
+	 * cases).
+	 */
+	am_cascading_walsender = RecoveryInProgress();
+
+	if (am_cascading_walsender)
+		GetXLogReplayRecPtr(&currTLI);
+	else
+		currTLI = GetWALInsertionTimeLine();
+
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen, currTLI);
 	sendTimeLineIsHistoric = (state->currTLI != currTLI);
 	sendTimeLine = state->currTLI;
 	sendTimeLineValidUpto = state->currTLIValidUntil;
 	sendTimeLineNextTLI = state->nextTLI;
 
-	/* make sure we have enough WAL available */
-	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
-
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 		return -1;
@@ -937,9 +948,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
-				 state->seg.ws_tli, /* Pass the current TLI because only
-									 * WalSndSegmentOpen controls whether new
-									 * TLI is needed. */
+				 currTLI,		/* Pass the current TLI because only
+								 * WalSndSegmentOpen controls whether new TLI
+								 * is needed. */
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
@@ -3076,10 +3087,14 @@ XLogSendLogical(void)
 	 * If first time through in this session, initialize flushPtr.  Otherwise,
 	 * we only need to update flushPtr if EndRecPtr is past it.
 	 */
-	if (flushPtr == InvalidXLogRecPtr)
-		flushPtr = GetFlushRecPtr(NULL);
-	else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
-		flushPtr = GetFlushRecPtr(NULL);
+	if (flushPtr == InvalidXLogRecPtr ||
+		logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	{
+		if (am_cascading_walsender)
+			flushPtr = GetStandbyFlushRecPtr(NULL);
+		else
+			flushPtr = GetFlushRecPtr(NULL);
+	}
 
 	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
 	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
@@ -3170,7 +3185,8 @@ GetStandbyFlushRecPtr(TimeLineID *tli)
 	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
-	*tli = replayTLI;
+	if (tli)
+		*tli = replayTLI;
 
 	result = replayPtr;
 	if (receiveTLI == replayTLI && receivePtr > replayPtr)
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4211d31f307..326f8bc2f47 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27074,6 +27074,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         prepared with <xref linkend="sql-prepare-transaction"/>.
        </para></entry>
       </row>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_log_standby_snapshot</primary>
+        </indexterm>
+        <function>pg_log_standby_snapshot</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one. This one is useful for
+        logical decoding on standby for which logical slot creation is hanging
+        until such a record is replayed on the standby.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
-- 
2.38.0

va67-0008-New-TAP-test-for-logical-decoding-on-standby.patchtext/x-diff; charset=iso-8859-1Download

From c79a99299a02e29fb8deb1fb55250a76805f273c Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH va67 8/9] New TAP test for logical decoding on standby
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In addition to the new TAP test, this commit introduces a new pg_log_standby_snapshot()
function.

The idea is to be able to take a snapshot of running transactions and write this
to WAL without requesting for a (costly) checkpoint.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Amit Khandekar <amitdkhan.pg@gmail.com>
Author: Craig Ringer <craig@2ndquadrant.com> (in an older version)
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: FabrÃzio de Royes Mello <fabriziomello@gmail.com>
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 720 ++++++++++++++++++
 3 files changed, 758 insertions(+)
 create mode 100644 src/test/recovery/t/035_standby_logical_decoding.pl

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e91..62376de602b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3f..e834ad5e0dc 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 00000000000..561dcd33c3b
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,720 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node, $slot_prefix) = @_;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot), 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, qq($active_slot), 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	my ($slot_prefix) = @_;
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+
+	$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
+	$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$active_slot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $slot_prefix, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', qq($active_slot), '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			qq[SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = '$active_slot' AND active_pid IS NOT NULL)]
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_prefix, $slot_user_handle) = @_;
+
+	is($node_standby->slot($slot_prefix . 'inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot($slot_prefix . 'activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Change hot_standby_feedback and check xmin and catalog_xmin values.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+# Drop the slots, re-create them, change hot_standby_feedback,
+# check xmin and catalog_xmin values, make slot active and reset stat.
+sub reactive_slots_change_hfs_and_wait_for_xmins
+{
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+
+	# drop the logical slots
+	drop_logical_slots($previous_slot_prefix);
+
+	# create the logical slots
+	create_logical_slots($node_standby, $slot_prefix);
+
+	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
+
+	$handle = make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+
+	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
+	$node_standby->psql('testdb', q[select pg_stat_reset();]);
+}
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+sub check_for_invalidation
+{
+	my ($slot_prefix, $log_start, $test_name) = @_;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+
+	# message should be issued
+	ok( find_in_log(
+		$node_standby,
+		"invalidating obsolete replication slot \"$inactive_slot\"", $log_start),
+		"inactiveslot slot invalidation is logged $test_name");
+
+	ok( find_in_log(
+		$node_standby,
+		"invalidating obsolete replication slot \"$active_slot\"", $log_start),
+		"activeslot slot invalidation is logged $test_name");
+
+	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+}
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby, 'behaves_ok_');
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('behaves_ok_activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'behaves_ok_activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'behaves_ok_activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'behaves_ok_activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_', 0, 1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM full pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"vacuum_full_activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Verify that invalidated logical slots stay invalidated across a restart.
+##################################################
+$node_standby->restart;
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+##################################################
+# Verify that invalidated logical slots do not lead to retaining WAL
+##################################################
+# XXXXX TODO
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_', 0, 1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[DROP TABLE conflict_test;]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_class;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"row_removal_activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a shared catalog table
+# Scenario 3: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 'shared_row_removal_', 0, 1);
+
+# Trigger the conflict. The txid_current() is to ensure there's some WAL
+# record associated with the database, otherwise the wait below does not work
+# reliably.
+$node_primary->safe_psql('testdb', qq[
+  CREATE ROLE create_trash;
+  DROP ROLE create_trash;
+  SELECT txid_current();
+]);
+$node_primary->safe_psql('testdb', 'VACUUM pg_authid;');
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('shared_row_removal_', $logstart, 'with vacuum on pg_authid');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"shared_row_removal_activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 4: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 'no_conflict_', 0, 1);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE conflict_test(x integer, y text);
+									  INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
+									  UPDATE conflict_test set x=1, y=1;
+									  VACUUM conflict_test;]);
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"no_conflict_activeslot\"", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 0) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+# Restart the standby node to ensure no slots are still active
+$node_standby->restart;
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0, 0);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"pruning_activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots('pruning_');
+
+# create the logical slots
+create_logical_slots($node_standby, 'wal_level_');
+
+$handle = make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
+
+# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
+$node_standby->psql('testdb', q[select pg_stat_reset();]);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'wal_level_', 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 'wal_level_', 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"wal_level_activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots('wal_level_');
+
+# create the logical slots
+create_logical_slots($node_standby, 'drop_db_');
+
+$handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped('drop_db', $handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby, 'promotion_');
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby, 'promotion_');
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 'promotion_', 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 'promotion_', 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('promotion_inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('promotion_inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
+
+done_testing();
-- 
2.38.0

va67-0008-TAP-test-for-logical-decoding-on-standby.patchtext/x-diff; charset=iso-8859-1Download

From 3d516d367777d695e66666f3a989a132d089f354 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 09:04:12 +0000
Subject: [PATCH va67 8/9] TAP test for logical decoding on standby
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Amit Khandekar <amitdkhan.pg@gmail.com>
Author: Craig Ringer <craig@2ndquadrant.com> (in an older version)
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: FabrÃzio de Royes Mello <fabriziomello@gmail.com>
---
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  37 +
 src/test/recovery/meson.build                 |   1 +
 .../t/035_standby_logical_decoding.pl         | 732 ++++++++++++++++++
 3 files changed, 770 insertions(+)
 create mode 100644 src/test/recovery/t/035_standby_logical_decoding.pl

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a3aef8b5e91..62376de602b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3076,6 +3076,43 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{
+	my ($self, $primary, $slot_name, $dbname) = @_;
+	my ($stdout, $stderr);
+
+	my $handle;
+
+	$handle = IPC::Run::start(['pg_recvlogical', '-d', $self->connstr($dbname), '-P', 'test_decoding', '-S', $slot_name, '--create-slot'], '>', \$stdout, '2>', \$stderr);
+
+	# Once slot restart_lsn is created, the standby looks for xl_running_xacts
+	# WAL record from the restart_lsn onwards. So firstly, wait until the slot
+	# restart_lsn is evaluated.
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Now arrange for the xl_running_xacts record for which pg_recvlogical
+	# is waiting.
+	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+
+	$handle->finish();
+
+	is($self->slot($slot_name)->{'slot_type'}, 'logical', $slot_name . ' on standby created')
+		or die "could not create slot" . $slot_name;
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3f..e834ad5e0dc 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_standby_logical_decoding.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
new file mode 100644
index 00000000000..ec56fa88cbf
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,732 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $res;
+
+# Name for the physical slot on primary
+my $primary_slotname = 'primary_physical';
+my $standby_physical_slotname = 'standby_physical';
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
+# given boolean condition to be true to ensure we've reached a quiescent state.
+sub wait_for_xmins
+{
+	my ($node, $slotname, $check_expr) = @_;
+
+	$node->poll_query_until(
+		'postgres', qq[
+		SELECT $check_expr
+		FROM pg_catalog.pg_replication_slots
+		WHERE slot_name = '$slotname';
+	]) or die "Timed out waiting for slot xmins to advance";
+}
+
+# Create the required logical slots on standby.
+sub create_logical_slots
+{
+	my ($node, $slot_prefix) = @_;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+	$node->create_logical_slot_on_standby($node_primary, qq($inactive_slot), 'testdb');
+	$node->create_logical_slot_on_standby($node_primary, qq($active_slot), 'testdb');
+}
+
+# Drop the logical slots on standby.
+sub drop_logical_slots
+{
+	my ($slot_prefix) = @_;
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+
+	$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
+	$node_standby->psql('postgres', qq[SELECT pg_drop_replication_slot('$active_slot')]);
+}
+
+# Acquire one of the standby logical slots created by create_logical_slots().
+# In case wait is true we are waiting for an active pid on the 'activeslot' slot.
+# If wait is not true it means we are testing a known failure scenario.
+sub make_slot_active
+{
+	my ($node, $slot_prefix, $wait, $to_stdout, $to_stderr) = @_;
+	my $slot_user_handle;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	$slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node->connstr('testdb'), '-S', qq($active_slot), '-o', 'include-xids=0', '-o', 'skip-empty-xacts=1', '--no-loop', '--start', '-f', '-'], '>', $to_stdout, '2>', $to_stderr);
+
+	if ($wait)
+	{
+		# make sure activeslot is in use
+		$node->poll_query_until('testdb',
+			qq[SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = '$active_slot' AND active_pid IS NOT NULL)]
+		) or die "slot never became active";
+	}
+	return $slot_user_handle;
+}
+
+# Check pg_recvlogical stderr
+sub check_pg_recvlogical_stderr
+{
+	my ($slot_user_handle, $check_stderr) = @_;
+	my $return;
+
+	# our client should've terminated in response to the walsender error
+	$slot_user_handle->finish;
+	$return = $?;
+	cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero");
+	if ($return) {
+		like($stderr, qr/$check_stderr/, 'slot has been invalidated');
+	}
+
+	return 0;
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+	my ($slot_prefix, $slot_user_handle) = @_;
+
+	is($node_standby->slot($slot_prefix . 'inactiveslot')->{'slot_type'}, '', 'inactiveslot on standby dropped');
+	is($node_standby->slot($slot_prefix . 'activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
+
+	check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Change hot_standby_feedback and check xmin and catalog_xmin values.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+	my ($hsf, $invalidated) = @_;
+
+	$node_standby->append_conf('postgresql.conf',qq[
+	hot_standby_feedback = $hsf
+	]);
+
+	$node_standby->reload;
+
+	if ($hsf && $invalidated)
+	{
+		# With hot_standby_feedback on, xmin should advance,
+		# but catalog_xmin should still remain NULL since there is no logical slot.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NULL");
+	}
+	elsif ($hsf)
+	{
+		# With hot_standby_feedback on, xmin and catalog_xmin should advance.
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NOT NULL AND catalog_xmin IS NOT NULL");
+	}
+	else
+	{
+		# Both should be NULL since hs_feedback is off
+		wait_for_xmins($node_primary, $primary_slotname,
+			   "xmin IS NULL AND catalog_xmin IS NULL");
+
+	}
+}
+
+# Check conflicting status in pg_replication_slots.
+sub check_slots_conflicting_status
+{
+	my ($conflicting) = @_;
+
+	if ($conflicting)
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				 select bool_and(conflicting) from pg_replication_slots;));
+
+		is($res, 't',
+			"Logical slots are reported as conflicting");
+	}
+	else
+	{
+		$res = $node_standby->safe_psql(
+				'postgres', qq(
+				select bool_or(conflicting) from pg_replication_slots;));
+
+		is($res, 'f',
+			"Logical slots are reported as non conflicting");
+	}
+}
+
+# Drop the slots, re-create them, change hot_standby_feedback,
+# check xmin and catalog_xmin values, make slot active and reset stat.
+sub reactive_slots_change_hfs_and_wait_for_xmins
+{
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+
+	# drop the logical slots
+	drop_logical_slots($previous_slot_prefix);
+
+	# create the logical slots
+	create_logical_slots($node_standby, $slot_prefix);
+
+	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
+
+	$handle = make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+
+	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
+	$node_standby->psql('testdb', q[select pg_stat_reset();]);
+}
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+sub check_for_invalidation
+{
+	my ($slot_prefix, $log_start, $test_name) = @_;
+
+	my $active_slot = $slot_prefix . 'activeslot';
+	my $inactive_slot = $slot_prefix . 'inactiveslot';
+
+	# message should be issued
+	ok( find_in_log(
+		$node_standby,
+		"invalidating obsolete replication slot \"$inactive_slot\"", $log_start),
+		"inactiveslot slot invalidation is logged $test_name");
+
+	ok( find_in_log(
+		$node_standby,
+		"invalidating obsolete replication slot \"$active_slot\"", $log_start),
+		"activeslot slot invalidation is logged $test_name");
+
+	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
+	ok( $node_standby->poll_query_until(
+		'postgres',
+		"select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+		'confl_active_logicalslot updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+}
+
+########################
+# Initialize primary node
+########################
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
+$node_primary->dump_info;
+$node_primary->start;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+
+$node_primary->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]);
+
+# Check conflicting is NULL for physical slot
+$res = $node_primary->safe_psql(
+		'postgres', qq[
+		 SELECT conflicting is null FROM pg_replication_slots where slot_name = '$primary_slotname';]);
+
+is($res, 't',
+	"Physical slot reports conflicting as NULL");
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+# Some tests need to wait for VACUUM to be replayed. But vacuum does not flush
+# WAL. An insert into flush_wal outside transaction does guarantee a flush.
+$node_primary->psql('testdb', q[CREATE TABLE flush_wal();]);
+
+#######################
+# Initialize standby node
+#######################
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
+
+#######################
+# Initialize cascading standby node
+#######################
+$node_standby->backup($backup_name);
+$node_cascading_standby->init_from_backup(
+	$node_standby, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_cascading_standby->append_conf('postgresql.conf',
+	qq[primary_slot_name = '$standby_physical_slotname']);
+$node_cascading_standby->start;
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+##################################################
+# Test that logical decoding on the standby
+# behaves correctly.
+##################################################
+
+# create the logical slots
+create_logical_slots($node_standby, 'behaves_ok_');
+
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $result = $node_standby->safe_psql('testdb',
+	qq[SELECT pg_logical_slot_get_changes('behaves_ok_activeslot', NULL, NULL);]);
+
+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+	14, 'Decoding produced 14 rows (2 BEGIN/COMMIT and 10 rows)');
+
+# Insert some rows and verify that we get the same results from pg_recvlogical
+# and the SQL interface.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT};
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+my $stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session');
+
+my $endpos = $node_standby->safe_psql('testdb',
+	"SELECT lsn FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+);
+
+# Insert some rows after $endpos, which we won't read.
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;]
+);
+
+$node_primary->wait_for_catchup($node_standby);
+
+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'behaves_ok_activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+$node_standby->poll_query_until('testdb',
+	"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'behaves_ok_activeslot' AND active_pid IS NULL)"
+) or die "slot never became inactive";
+
+$stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'behaves_ok_activeslot', $endpos, $default_timeout,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');
+chomp($stdout_recv);
+is($stdout_recv, '', 'pg_recvlogical acknowledged changes');
+
+$node_primary->safe_psql('postgres', 'CREATE DATABASE otherdb');
+
+is( $node_primary->psql(
+        'otherdb',
+        "SELECT lsn FROM pg_logical_slot_peek_changes('behaves_ok_activeslot', NULL, NULL) ORDER BY lsn DESC LIMIT 1;"
+    ),
+    3,
+    'replaying logical slot from another database fails');
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 1: hot_standby_feedback off and vacuum FULL
+##################################################
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum full on pg_class with hot_standby_feedback turned off on
+# the standby.
+reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_', 0, 1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[
+  CREATE TABLE conflict_test(x integer, y text);
+  DROP TABLE conflict_test;
+  VACUUM full pg_class;
+  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
+]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'vacuum_full_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"vacuum_full_activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1,1);
+
+##################################################
+# Verify that invalidated logical slots stay invalidated across a restart.
+##################################################
+$node_standby->restart;
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+##################################################
+# Verify that invalidated logical slots do not lead to retaining WAL
+##################################################
+# XXXXX TODO
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_', 0, 1);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[
+  CREATE TABLE conflict_test(x integer, y text);
+  DROP TABLE conflict_test;
+  VACUUM pg_class;
+  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
+]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"row_removal_activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a shared catalog table
+# Scenario 3: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# One way to produce recovery conflict is to create/drop a relation and
+# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
+reactive_slots_change_hfs_and_wait_for_xmins('row_removal_', 'shared_row_removal_', 0, 1);
+
+# Trigger the conflict
+diag $node_primary->safe_psql('testdb', qq[
+  CREATE ROLE create_trash;
+  DROP ROLE create_trash;
+  VACUUM pg_authid;
+  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
+]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('shared_row_removal_', $logstart, 'with vacuum on pg_authid');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"shared_row_removal_activeslot\"");
+
+##################################################
+# Recovery conflict: Same as Scenario 2 but on a non catalog table
+# Scenario 4: No conflict expected.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_', 'no_conflict_', 0, 1);
+
+# This should not trigger a conflict
+$node_primary->safe_psql('testdb', qq[
+  CREATE TABLE conflict_test(x integer, y text);
+  INSERT INTO conflict_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;
+  UPDATE conflict_test set x=1, y=1;
+  VACUUM conflict_test;
+  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
+]);
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# message should not be issued
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"no_conflict_inactiveslot\"", $logstart),
+  'inactiveslot slot invalidation is not logged with vacuum on conflict_test');
+
+ok( !find_in_log(
+   $node_standby,
+  "invalidating obsolete slot \"no_conflict_activeslot\"", $logstart),
+  'activeslot slot invalidation is not logged with vacuum on conflict_test');
+
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been updated
+ok( $node_standby->poll_query_until(
+	'postgres',
+	"select (confl_active_logicalslot = 0) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
+	'confl_active_logicalslot not updated') or die "Timed out waiting confl_active_logicalslot to be updated";
+
+# Verify slots are reported as non conflicting in pg_replication_slots
+check_slots_conflicting_status(0);
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 0);
+
+# Restart the standby node to ensure no slots are still active
+$node_standby->restart;
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 4: conflict due to on-access pruning.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# One way to produce recovery conflict is to trigger an on-access pruning
+# on a relation marked as user_catalog_table.
+reactive_slots_change_hfs_and_wait_for_xmins('no_conflict_', 'pruning_', 0, 0);
+
+# This should trigger the conflict
+$node_primary->safe_psql('testdb', qq[CREATE TABLE prun(id integer, s char(2000)) WITH (fillfactor = 75, user_catalog_table = true);]);
+$node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('pruning_', $logstart, 'with on-access pruning');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'pruning_', 0, \$stdout, \$stderr);
+
+# We are not able to read from the slot as it has been invalidated
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"pruning_activeslot\"");
+
+# Turn hot_standby_feedback back on
+change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+
+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 5: incorrect wal_level on primary.
+##################################################
+
+# get the position to search from in the standby logfile
+$logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+drop_logical_slots('pruning_');
+
+# create the logical slots
+create_logical_slots($node_standby, 'wal_level_');
+
+$handle = make_slot_active($node_standby, 'wal_level_', 1, \$stdout, \$stderr);
+
+# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
+$node_standby->psql('testdb', q[select pg_stat_reset();]);
+
+# Make primary wal_level replica. This will trigger slot conflict.
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'replica'
+]);
+$node_primary->restart;
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Check invalidation in the logfile and in pg_stat_database_conflicts
+check_for_invalidation('wal_level_', $logstart, 'due to wal_level');
+
+# Verify slots are reported as conflicting in pg_replication_slots
+check_slots_conflicting_status(1);
+
+$handle = make_slot_active($node_standby, 'wal_level_', 0, \$stdout, \$stderr);
+# We are not able to read from the slot as it requires wal_level at least logical on the primary server
+check_pg_recvlogical_stderr($handle, "logical decoding on standby requires wal_level to be at least logical on the primary server");
+
+# Restore primary wal_level
+$node_primary->append_conf('postgresql.conf',q[
+wal_level = 'logical'
+]);
+$node_primary->restart;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$handle = make_slot_active($node_standby, 'wal_level_', 0, \$stdout, \$stderr);
+# as the slot has been invalidated we should not be able to read
+check_pg_recvlogical_stderr($handle, "cannot read from logical replication slot \"wal_level_activeslot\"");
+
+##################################################
+# DROP DATABASE should drops it's slots, including active slots.
+##################################################
+
+# drop the logical slots
+drop_logical_slots('wal_level_');
+
+# create the logical slots
+create_logical_slots($node_standby, 'drop_db_');
+
+$handle = make_slot_active($node_standby, 'drop_db_', 1, \$stdout, \$stderr);
+
+# Create a slot on a database that would not be dropped. This slot should not
+# get dropped.
+$node_standby->create_logical_slot_on_standby($node_primary, 'otherslot', 'postgres');
+
+# dropdb on the primary to verify slots are dropped on standby
+$node_primary->safe_psql('postgres', q[DROP DATABASE testdb]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+is($node_standby->safe_psql('postgres',
+	q[SELECT EXISTS(SELECT 1 FROM pg_database WHERE datname = 'testdb')]), 'f',
+	'database dropped on standby');
+
+check_slots_dropped('drop_db', $handle);
+
+is($node_standby->slot('otherslot')->{'slot_type'}, 'logical',
+	'otherslot on standby not dropped');
+
+# Cleanup : manually drop the slot that was not dropped.
+$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('otherslot')]);
+
+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+
+$node_standby->reload;
+
+$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
+$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
+
+# create the logical slots
+create_logical_slots($node_standby, 'promotion_');
+
+# create the logical slots on the cascading standby too
+create_logical_slots($node_cascading_standby, 'promotion_');
+
+# Make slots actives
+$handle = make_slot_active($node_standby, 'promotion_', 1, \$stdout, \$stderr);
+my $cascading_handle = make_slot_active($node_cascading_standby, 'promotion_', 1, \$cascading_stdout, \$cascading_stderr);
+
+# Insert some rows before the promotion
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]
+);
+
+# Wait for both standbys to catchup
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
+# promote
+$node_standby->promote;
+
+# insert some rows on promoted standby
+$node_standby->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,7) s;]
+);
+
+# Wait for the cascading standby to catchup
+$node_standby->wait_for_replay_catchup($node_cascading_standby);
+
+$expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+# check that we are decoding pre and post promotion inserted rows
+$stdout_sql = $node_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('promotion_inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on promoted standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+
+ok( pump_until(
+        $handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($stdout);
+is($stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session');
+
+# check that we are decoding pre and post promotion inserted rows on the cascading standby
+$stdout_sql = $node_cascading_standby->safe_psql('testdb',
+	qq[SELECT data FROM pg_logical_slot_peek_changes('promotion_inactiveslot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]
+);
+
+is($stdout_sql, $expected, 'got expected output from SQL decoding session on cascading standby');
+
+# check that we are decoding pre and post promotion inserted rows
+# with pg_recvlogical that has started before the promotion on the cascading standby
+ok( pump_until(
+        $cascading_handle, $pump_timeout, \$cascading_stdout, qr/^.*COMMIT.*COMMIT$/s),
+    'got 2 COMMIT from pg_recvlogical output');
+
+chomp($cascading_stdout);
+is($cascading_stdout, $expected,
+    'got same expected output from pg_recvlogical decoding session on cascading standby');
+
+done_testing();
-- 
2.38.0

va67-0009-Doc-changes-describing-details-about-logical-de.patchtext/x-diff; charset=us-asciiDownload

From 14ebbf66cc134aa08036a02dc23482adff74fb3e Mon Sep 17 00:00:00 2001
From: bdrouvotAWS <bdrouvot@amazon.com>
Date: Tue, 7 Feb 2023 14:08:11 +0000
Subject: [PATCH va67 9/9] Doc changes describing details about logical
 decoding.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello
---
 doc/src/sgml/logicaldecoding.sgml | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4e912b4bd48..8651024b8a6 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -316,6 +316,33 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
      may consume changes from a slot at any given time.
     </para>
 
+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Then, the
+     primary may delete system catalog rows that could be needed by the logical
+     decoding on the standby (as it does not know about the catalog_xmin on the
+     standby). Existing logical slots on standby also get invalidated if wal_level
+     on primary is reduced to less than 'logical'. This is done as soon as the
+     standby detects such a change in the WAL stream. It means, that for walsenders
+     that are lagging (if any), some WAL records up to the wal_level parameter change
+     on the primary won't be decoded.
+    </para>
+
+    <para>
+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time. One option to speed it
+     is to call the <function>pg_log_standby_snapshot</function> on the primary.
+    </para>
+
     <caution>
      <para>
       Replication slots persist across crashes and know nothing about the state
-- 
2.38.0

#338

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#336)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 5:47 PM, Andres Freund wrote:

Hi,

- write a test that invalidated logical slots do not lead to retaining WAL

I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
a standby.

You can do it on the primary and wait for the records to have been applied.

Thanks, will give it a try in a couple of hours.

- Further evolve the API of InvalidateObsoleteReplicationSlots()
- pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
- rename xid to snapshotConflictHorizon, that'd be more in line with the
ResolveRecoveryConflictWithSnapshot and easier to understand, I think

Done. The new API can be found in v65-66-InvalidateObsoleteReplicationSlots_API.patch
attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a switch/case
can now be used.

Integrated. I moved the cause to the first argument, makes more sense to me
that way.

thanks!

I made it an error - it's a programming error, not some data level
inconsistency if that ever happens.

okay, makes sense.

Integrated all of these.

Thanks!

I think pg_log_standby_snapshot() should be added in "Allow logical decoding
on standby", not the commit adding the tests.

Yeah, that's a good point, I do agree.

Is this patchset sufficient to subscribe to a publication on a physical
standby, assuming the publication is created on the primary? If so, we should
have at least a minimal test. If not, we should note that restriction
explicitly.

I gave it a try and it does work.

"
node3 subscribes to node2 (standby).
Insert done in node1 (primary) where the publication is created => node3 see the changes.
"

I started to create the TAP test but currently stuck as the "create subscription" waits for a checkpoint/pg_log_standby_snapshot() on the primary.

So, trying to make use of things like:

"my %psql_subscriber = ('stdin' => '', 'stdout' => '');
$psql_subscriber{run} =
$node_subscriber->background_psql('postgres', \$psql_subscriber{stdin},
\$psql_subscriber{stdout},
$psql_timeout);
$psql_subscriber{stdout} = '';
"

But in vain so far...

Will resume working on it in a couple of hours.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#339

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#337)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 8:12 PM, Andres Freund wrote:

Hi,

On 2023-04-07 08:47:57 -0700, Andres Freund wrote:

Integrated all of these.

Here's my current version. Changes:
- Integrated Bertrand's changes
- polished commit messages of 0001-0003
- edited code comments for 0003, including
InvalidateObsoleteReplicationSlots()'s header
- added a bump of SLOT_VERSION to 0001
- moved addition of pg_log_standby_snapshot() to 0007
- added a catversion bump for pg_log_standby_snapshot()
- moved all the bits dealing with procsignals from 0003 to 0004, now the split
makes sense IMO
- combined a few more sucessive ->safe_psql() calls

Thanks!

I see occasional failures in the tests, particularly in the new test using
pg_authid, but not solely. cfbot also seems to have seen these:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3740

I made a bogus attempt at a workaround for the pg_authid case last night. But
that didn't actually fix anything, it just changed the timing.

I think the issue is that VACUUM does not force WAL to be flushed at the end
(since it does not assign an xid). wait_for_replay_catchup() uses
$node->lsn('flush'), which, due to VACUUM not flushing, can be an LSN from
before VACUUM completed.

The problem can be made more likely by adding pg_usleep(1000000); before
walwriter.c's call to XLogBackgroundFlush().

We probably should introduce some infrastructure in Cluster.pm for this, but
for now I just added a 'flush_wal' table that we insert into after a
VACUUM. That guarantees a WAL flush.

Ack for the Cluster.pm "improvement" and thanks for the "workaround"!

I think some of the patches might have more reviewers than really applicable,
and might also miss some. I'd appreciate if you could go over that...

Sure, will do in a couple of hours.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#340

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#339)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 8:27 PM, Drouvot, Bertrand wrote:

Hi,

I think some of the patches might have more reviewers than really applicable,
and might also miss some. I'd appreciate if you could go over that...

Sure, will do in a couple of hours.

That looks good to me, just few remarks:

0005 is missing author/reviewer, I'd propose:

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: /messages/by-id/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de

0006, I'd propose:

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>

0007, I'd propose:

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>

0009, I'd propose:

It's hard (given the amount of emails that have been send during all this time),
but I do hope it's correct and that nobody is missing.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#341

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#338)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/7/23 8:24 PM, Drouvot, Bertrand wrote:

Hi,

On 4/7/23 5:47 PM, Andres Freund wrote:

Hi,

- write a test that invalidated logical slots do not lead to retaining WAL

I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
a standby.

You can do it on the primary and wait for the records to have been applied.

Thanks, will give it a try in a couple of hours.

I looked at it but I think we'd also need things like pg_walfile_name() on the standby but is not allowed.

Is this patchset sufficient to subscribe to a publication on a physical
standby, assuming the publication is created on the primary? If so, we should
have at least a minimal test. If not, we should note that restriction
explicitly.

I gave it a try and it does work.

"
node3 subscribes to node2 (standby).
Insert done in node1 (primary) where the publication is created => node3 see the changes.
"

I started to create the TAP test but currently stuck as the "create subscription" waits for a checkpoint/pg_log_standby_snapshot() on the primary.

So, trying to make use of things like:

"my %psql_subscriber = ('stdin' => '', 'stdout' => '');
$psql_subscriber{run} =
$node_subscriber->background_psql('postgres', \$psql_subscriber{stdin},
\$psql_subscriber{stdout},
$psql_timeout);
$psql_subscriber{stdout} = '';
"

But in vain so far...

please find attached sub_in_progress.patch that "should work" but "does not" because
the wait_for_subscription_sync() call produces:

"
error running SQL: 'psql:<stdin>:1: ERROR: recovery is in progress
HINT: WAL control functions cannot be executed during recovery.'
while running 'psql -XAtq -d port=61441 host=/tmp/45dt3wqs2p dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT pg_current_wal_lsn()'
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

sub_in_progress.patchtext/plain; charset=UTF-8; name=sub_in_progress.patchDownload

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 561dcd33c3..c3c0e718c8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -8,14 +8,18 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, $handle, $slot);
+my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $subscriber_stdin, $subscriber_stdout, $subscriber_stderr, $ret, $handle, $slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout =
+  IPC::Run::timer(2 * $PostgreSQL::Test::Utils::timeout_default);
 my $res;
 
+
 # Name for the physical slot on primary
 my $primary_slotname = 'primary_physical';
 my $standby_physical_slotname = 'standby_physical';
@@ -263,6 +267,7 @@ $node_standby->init_from_backup(
 	has_restoring => 1);
 $node_standby->append_conf('postgresql.conf',
 	qq[primary_slot_name = '$primary_slotname']);
+$node_standby->append_conf('postgresql.conf', 'max_replication_slots = 6');
 $node_standby->start;
 $node_primary->wait_for_replay_catchup($node_standby);
 $node_standby->safe_psql('testdb', qq[SELECT * FROM pg_create_physical_replication_slot('$standby_physical_slotname');]);
@@ -280,6 +285,20 @@ $node_cascading_standby->append_conf('postgresql.conf',
 $node_cascading_standby->start;
 $node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
 
+#######################
+# Initialize subscriber node
+#######################
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');
+$node_subscriber->start;
+
+my %psql_subscriber = ('subscriber_stdin' => '', 'subscriber_stdout' => '');
+$psql_subscriber{run} =
+  $node_subscriber->background_psql('postgres', \$psql_subscriber{subscriber_stdin},
+    \$psql_subscriber{subscriber_stdout},
+    $psql_timeout);
+$psql_subscriber{subscriber_stdout} = '';
+
 ##################################################
 # Test that logical decoding on the standby
 # behaves correctly.
@@ -360,6 +379,43 @@ is( $node_primary->psql(
     3,
     'replaying logical slot from another database fails');
 
+##################################################
+# Test that we can subscribe on the standby with the publication
+# created on the primary.
+##################################################
+
+# Create a table on the primary
+$node_primary->safe_psql('postgres', "CREATE TABLE tab_rep (a int primary key)");
+
+# Create a table (same structure) on the subscriber node
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_rep (a int primary key)");
+
+# Create a publication on the primary
+$node_primary->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_primary->safe_psql('postgres', "ALTER PUBLICATION tap_pub ADD TABLE tab_rep");
+
+my $standby_connstr = $node_standby->connstr . ' dbname=postgres';
+
+# Subscribe on the standby
+$psql_subscriber{stdin} .= qq[
+							CREATE SUBSCRIPTION tap_sub CONNECTION '$standby_connstr' PUBLICATION tap_pub;
+							];
+$psql_subscriber{run}->pump_nb;
+$node_primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+$psql_subscriber{run}->finish;
+
+$node_primary->safe_psql('postgres', qq[INSERT INTO tab_rep select generate_series(1,10);]);
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Wait for sync to finish
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_rep");
+is($result, qq(10),
+    'check replicated inserts after subscription on standby');
+
+$node_subscriber->stop;
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -402,7 +458,6 @@ check_slots_conflicting_status(1);
 ##################################################
 # Verify that invalidated logical slots do not lead to retaining WAL
 ##################################################
-# XXXXX TODO
 
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots

#342

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#340)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 22:54:01 +0200, Drouvot, Bertrand wrote:

That looks good to me

Cool.

I think I'll push these in a few hours. While this needed more changes than
I'd like shortly before the freeze, I think they're largely not in very
interesting bits and pieces - and this feature has been in the works for about
three eternities, and it is blocking a bunch of highly requested features.

If anybody still has energy, I would appreciate a look at 0001, 0002, the new
pieces I added, to make what's now 0003 and 0004 cleaner.

0005 is missing author/reviewer, I'd propose:
[...]

Thanks, I'll integrate them...

It's hard (given the amount of emails that have been send during all this time),

Indeed.

Greetings,

Andres Freund

#343

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 3 years ago

In reply to: Andres Freund (#337)

Re: Minimal logical decoding on standbys

I gave a very quick look at 0001 and 0003. I find no fault with 0001.
It was clear back when we added that stuff that invalidated_at was not
terribly useful -- I was just too conservative to not have it -- but now
that a lot of time has passed and we haven't done anything with it,
removing it seems perfectly OK.

As for 0003, I have no further concerns about the translatability.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

#344

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#337)

Re: Minimal logical decoding on standbys

Code review only of 0001-0005.

I noticed you had two 0008, btw.

On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:

Hi,

On 2023-04-07 08:47:57 -0700, Andres Freund wrote:

Integrated all of these.

From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 20:00:07 -0700
Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN with
an enum
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdfe..ebcb637baed 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
RS_TEMPORARY
} ReplicationSlotPersistency;
+/*
+ * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
+ * 'invalidated' field is set to a value other than _NONE.
+ */
+typedef enum ReplicationSlotInvalidationCause
+{
+	RS_INVAL_NONE,
+	/* required WAL has been removed */

I just wonder if RS_INVAL_WAL is too generic. Something like
RS_INVAL_WAL_MISSING or similar may be better since it seems there are
other inavlidation causes that may be related to WAL.

+	RS_INVAL_WAL,
+} ReplicationSlotInvalidationCause;
+

0002 LGTM

From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Apr 2023 09:32:48 -0700
Subject: [PATCH va67 3/9] Support invalidating replication slots due to
horizon and wal_level
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Needed for supporting logical decoding on a standby. The new invalidation
methods will be used in a subsequent commit.

You probably are aware, but applying 0003 and 0004 both gives me two
warnings:

warning: 1 line adds whitespace errors.
Warning: commit message did not conform to UTF-8.
You may want to amend it after fixing the message, or set the config
variable i18n.commitEncoding to the encoding your project uses.

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index df23b7ed31e..c2a9accebf6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void)
}

/*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Report that replication slot needs to be invalidated
+ */
+static void
+ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
+					   bool terminating,
+					   int pid,
+					   NameData slotname,
+					   XLogRecPtr restart_lsn,
+					   XLogRecPtr oldestLSN,
+					   TransactionId snapshotConflictHorizon)
+{
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	switch (cause)
+	{
+		case RS_INVAL_WAL:
+			hint = true;
+			appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+							 LSN_FORMAT_ARGS(restart_lsn),

I'm not sure what the below cast is meant to do. If you are trying to
protect against overflow/underflow, I think you'd need to cast before
doing the subtraction.

+							 (unsigned long long) (oldestLSN - restart_lsn));
+			break;
+		case RS_INVAL_HORIZON:
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."),
+							 snapshotConflictHorizon);
+			break;
+
+		case RS_INVAL_WAL_LEVEL:
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+			break;
+		case RS_INVAL_NONE:
+			pg_unreachable();
+	}

This ereport is quite hard to read. Is there any simplification you can
do of the ternaries without undue duplication?

+	ereport(LOG,
+			terminating ?
+			errmsg("terminating process %d to release replication slot \"%s\"",
+				   pid, NameStr(slotname)) :
+			errmsg("invalidating obsolete replication slot \"%s\"",
+				   NameStr(slotname)),
+			errdetail_internal("%s", err_detail.data),
+			hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
+
+	pfree(err_detail.data);
+}
+
+/*
+ * Helper for InvalidateObsoleteReplicationSlots
+ *
+ * Acquires the given slot and mark it invalid, if necessary and possible.
*
* Returns whether ReplicationSlotControlLock was released in the interim (and
* in that case we're not holding the lock at return, otherwise we are).
@@ -1253,7 +1303,10 @@ ReplicationSlotReserveWal(void)
* for syscalls, so caller must restart if we return true.
*/
static bool
-InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
+InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
+							   ReplicationSlot *s,
+							   XLogRecPtr oldestLSN,
+							   Oid dboid, TransactionId snapshotConflictHorizon,
bool *invalidated)
{
int			last_signaled_pid = 0;
@@ -1264,6 +1317,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
XLogRecPtr	restart_lsn;
NameData	slotname;
int			active_pid = 0;
+		ReplicationSlotInvalidationCause conflict = RS_INVAL_NONE;

Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));

@@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
restart_lsn = s->data.restart_lsn;

/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
*/
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (s->data.invalidated == RS_INVAL_NONE)
+		{
+			switch (cause)
+			{
+				case RS_INVAL_WAL:
+					if (s->data.restart_lsn != InvalidXLogRecPtr &&
+						s->data.restart_lsn < oldestLSN)
+						conflict = cause;
+					break;

Should the below be an error? a physical slot with RS_INVAL_HORIZON
invalidation cause?

+				case RS_INVAL_HORIZON:
+					if (!SlotIsLogical(s))
+						break;
+					/* invalid DB oid signals a shared relation */
+					if (dboid != InvalidOid && dboid != s->data.database)
+						break;
+					if (TransactionIdIsValid(s->effective_xmin) &&
+						TransactionIdPrecedesOrEquals(s->effective_xmin,
+													  snapshotConflictHorizon))
+						conflict = cause;
+					else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
+							 TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
+														   snapshotConflictHorizon))
+						conflict = cause;
+					break;
+				case RS_INVAL_WAL_LEVEL:
+					if (SlotIsLogical(s))
+						conflict = cause;
+					break;

All three of default, pg_unreachable(), and break seems a bit like
overkill. Perhaps remove the break?

+				default:
+					pg_unreachable();
+					break;
+			}
+		}
+

@@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
ReplicationSlotMarkDirty();
ReplicationSlotSave();
ReplicationSlotRelease();
+ pgstat_drop_replslot(s);

-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportSlotInvalidation(conflict, false, active_pid,
+								   slotname, restart_lsn,
+								   oldestLSN, snapshotConflictHorizon);

/* done with this slot for now */
break;
@@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
}

/*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate slots that require resources about to be removed.
*
* Returns true when any slot have got invalidated.
*
+ * Whether a slot needs to be invalidated depends on the cause. A slot is
+ * removed if it:
+ * - RS_INVAL_WAL: requires a LSN older than the given segment
+ * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the given db
+     dboid may be InvalidOid for shared relations

the comma above reduces readability

is this what you mean?

RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
db; dboid may be InvalidOid for shared relations

From 311a1d8f9c2d1acf0c22e091d53f7a533073c8b7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Apr 2023 09:56:02 -0700
Subject: [PATCH va67 4/9] Handle logical slot conflicts on standby
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During WAL replay on standby, when slot conflict is identified, invalidate
such slots. Also do the same thing if wal_level on the primary server is
reduced to below logical and there are existing logical slots on
standby. Introduce a new ProcSignalReason value for slot conflict recovery.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabr�zio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: /messages/by-id/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
---
src/include/storage/procsignal.h | 1 +
src/include/storage/standby.h | 2 ++
src/backend/access/gist/gistxlog.c | 2 ++
src/backend/access/hash/hash_xlog.c | 1 +
src/backend/access/heap/heapam.c | 3 +++
src/backend/access/nbtree/nbtxlog.c | 2 ++
src/backend/access/spgist/spgxlog.c | 1 +
src/backend/access/transam/xlog.c | 15 +++++++++++++++
src/backend/replication/slot.c | 8 +++++++-
src/backend/storage/ipc/procsignal.c | 3 +++
src/backend/storage/ipc/standby.c | 20 +++++++++++++++++++-
src/backend/tcop/postgres.c | 9 +++++++++
12 files changed, 65 insertions(+), 2 deletions(-)
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 905af2231ba..2f52100b009 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,7 @@ typedef enum
PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
PROCSIG_RECOVERY_CONFLICT_LOCK,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+	PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2effdea126f..41f4dc372e6 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
extern void ShutdownRecoveryTransactionEnvironment(void);
extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+												bool isCatalogRel,
RelFileLocator locator);
extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
+													   bool isCatalogRel,
RelFileLocator locator);
extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b7678f3c144..9a86fb3feff 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+ xldata->isCatalogRel,
rlocator);
}

@@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
*/
if (InHotStandby)
ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+ xlrec->isCatalogRel,
xlrec->locator);
}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index f2dd9be8d3f..e8e06c62a95 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+ xldata->isCatalogRel,
rlocator);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8b13e3f8925..f389ceee1ea 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8769,6 +8769,7 @@ heap_xlog_prune(XLogReaderState *record)
*/
if (InHotStandby)
ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->isCatalogRel,
rlocator);
/*
@@ -8940,6 +8941,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (InHotStandby)
ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+											xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
rlocator);
/*
@@ -9061,6 +9063,7 @@ heap_xlog_freeze_page(XLogReaderState *record)

XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+ xlrec->isCatalogRel,
rlocator);
}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 414ca4f6deb..c87e46ed66e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
+ xlrec->isCatalogRel,
rlocator);
}

@@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)

if (InHotStandby)
ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
+ xlrec->isCatalogRel,
xlrec->locator);
}
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index b071b59c8ac..459ac929ba5 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
+ xldata->isCatalogRel,
locator);
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1485e8f9ca9..5227fc675c8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7965,6 +7965,21 @@ xlog_redo(XLogReaderState *record)
/* Update our copy of the parameters in pg_control */
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
+		/*
+		 * Invalidate logical slots if we are in hot standby and the primary
+		 * does not have a WAL level sufficient for logical decoding. No need
+		 * to search for potentially conflicting logically slots if standby is
+		 * running with wal_level lower than logical, because in that case, we
+		 * would have either disallowed creation of logical slots or
+		 * invalidated existing ones.
+		 */
+		if (InRecovery && InHotStandby &&
+			xlrec.wal_level < WAL_LEVEL_LOGICAL &&
+			wal_level >= WAL_LEVEL_LOGICAL)
+			InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL,
+											   0, InvalidOid,
+											   InvalidTransactionId);
+
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->MaxConnections = xlrec.MaxConnections;
ControlFile->max_worker_processes = xlrec.max_worker_processes;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c2a9accebf6..1b1b51e21ed 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1443,7 +1443,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
slotname, restart_lsn,
oldestLSN, snapshotConflictHorizon);
-				(void) kill(active_pid, SIGTERM);
+				if (MyBackendType == B_STARTUP)

Is SendProcSignal() marked warn_unused_result or something? I don't see
other callers who don't use its return value void casting it.

+					(void) SendProcSignal(active_pid,
+										  PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
+										  InvalidBackendId);
+				else
+					(void) kill(active_pid, SIGTERM);
+
last_signaled_pid = active_pid;
}

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95cf..3b5d654347e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/slot.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
*/
void
ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
RelFileLocator locator)
{
VirtualTransactionId *backends;
@@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
true);
+
+	/*
+	 * Note that WaitExceedsMaxStandbyDelay() is not taken into account here
+	 * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That
+	 * seems OK, given that this kind of conflict should not normally be

do you mean "when using a physical replication slot"?

+	 * reached, e.g. by using a physical replication slot.
+	 */
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, locator.dbOid,
+										   snapshotConflictHorizon);
}

0005 LGTM

- Melanie

#345

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#344)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 18:32:04 -0400, Melanie Plageman wrote:

Code review only of 0001-0005.

I noticed you had two 0008, btw.

Yea, sorry for that. One was the older version. Just before sending the patch
I saw another occurance of a test failure, which I then fixed. In the course
of that I changed something in the patch subject.

On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
Hi,

On 2023-04-07 08:47:57 -0700, Andres Freund wrote:

Integrated all of these.

From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 6 Apr 2023 20:00:07 -0700
Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN with
an enum
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdfe..ebcb637baed 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
RS_TEMPORARY
} ReplicationSlotPersistency;
+/*
+ * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
+ * 'invalidated' field is set to a value other than _NONE.
+ */
+typedef enum ReplicationSlotInvalidationCause
+{
+	RS_INVAL_NONE,
+	/* required WAL has been removed */
I just wonder if RS_INVAL_WAL is too generic. Something like
RS_INVAL_WAL_MISSING or similar may be better since it seems there are
other inavlidation causes that may be related to WAL.

Renamed to RS_INVAL_WAL_REMOVED

From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Apr 2023 09:32:48 -0700
Subject: [PATCH va67 3/9] Support invalidating replication slots due to
horizon and wal_level
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Needed for supporting logical decoding on a standby. The new invalidation
methods will be used in a subsequent commit.

You probably are aware, but applying 0003 and 0004 both gives me two
warnings:

warning: 1 line adds whitespace errors.
Warning: commit message did not conform to UTF-8.
You may want to amend it after fixing the message, or set the config
variable i18n.commitEncoding to the encoding your project uses.

I did see the whitespace error, but not the encoding error. We have a bunch of
other commit messages with Fabrizio's name "properly spelled" in, so I think
that's ok.

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index df23b7ed31e..c2a9accebf6 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void)
}

/*
- * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
- * and mark it invalid, if necessary and possible.
+ * Report that replication slot needs to be invalidated
+ */
+static void
+ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
+					   bool terminating,
+					   int pid,
+					   NameData slotname,
+					   XLogRecPtr restart_lsn,
+					   XLogRecPtr oldestLSN,
+					   TransactionId snapshotConflictHorizon)
+{
+	StringInfoData err_detail;
+	bool		hint = false;
+
+	initStringInfo(&err_detail);
+
+	switch (cause)
+	{
+		case RS_INVAL_WAL:
+			hint = true;
+			appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
+							 LSN_FORMAT_ARGS(restart_lsn),

I'm not sure what the below cast is meant to do. If you are trying to
protect against overflow/underflow, I think you'd need to cast before
doing the subtraction.

+							 (unsigned long long) (oldestLSN - restart_lsn));
+			break;

That's our current way of passing 64bit numbers to format string
functions. It ends up as a 64bit number everywhere, even windows (with its
stupid ILP32 model).

+		case RS_INVAL_HORIZON:
+			appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."),
+							 snapshotConflictHorizon);
+			break;
+
+		case RS_INVAL_WAL_LEVEL:
+			appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on the primary server"));
+			break;
+		case RS_INVAL_NONE:
+			pg_unreachable();
+	}

This ereport is quite hard to read. Is there any simplification you can
do of the ternaries without undue duplication?

I tried a bunch, and none really seemed an improvement.

@@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
restart_lsn = s->data.restart_lsn;
/*
-		 * If the slot is already invalid or is fresh enough, we don't need to
-		 * do anything.
+		 * If the slot is already invalid or is a non conflicting slot, we
+		 * don't need to do anything.
*/
-		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		if (s->data.invalidated == RS_INVAL_NONE)
+		{
+			switch (cause)
+			{
+				case RS_INVAL_WAL:
+					if (s->data.restart_lsn != InvalidXLogRecPtr &&
+						s->data.restart_lsn < oldestLSN)
+						conflict = cause;
+					break;
Should the below be an error? a physical slot with RS_INVAL_HORIZON
invalidation cause?

InvalidatePossiblyObsoleteSlot() gets called for all existing slots, so it's
normal for RS_INVAL_HORIZON to encounter a physical slot.

+				case RS_INVAL_HORIZON:
+					if (!SlotIsLogical(s))
+						break;
+					/* invalid DB oid signals a shared relation */
+					if (dboid != InvalidOid && dboid != s->data.database)
+						break;
+					if (TransactionIdIsValid(s->effective_xmin) &&
+						TransactionIdPrecedesOrEquals(s->effective_xmin,
+													  snapshotConflictHorizon))
+						conflict = cause;
+					else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
+							 TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
+														   snapshotConflictHorizon))
+						conflict = cause;
+					break;
+				case RS_INVAL_WAL_LEVEL:
+					if (SlotIsLogical(s))
+						conflict = cause;
+					break;

All three of default, pg_unreachable(), and break seems a bit like
overkill. Perhaps remove the break?

I always get nervous about case statements without a break, due to the
fallthrough behaviour of switch/case. So I add it very habitually. I replaced
it with
case RS_INVAL_NONE:
pg_unreachable();
that way we get warnings about unhandled cases too. Not sure why I hadn't done
that.

@@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
ReplicationSlotMarkDirty();
ReplicationSlotSave();
ReplicationSlotRelease();
+ pgstat_drop_replslot(s);
-			ereport(LOG,
-					errmsg("invalidating obsolete replication slot \"%s\"",
-						   NameStr(slotname)),
-					errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
-							  LSN_FORMAT_ARGS(restart_lsn),
-							  (unsigned long long) (oldestLSN - restart_lsn)),
-					errhint("You might need to increase max_slot_wal_keep_size."));
+			ReportSlotInvalidation(conflict, false, active_pid,
+								   slotname, restart_lsn,
+								   oldestLSN, snapshotConflictHorizon);
/* done with this slot for now */
break;
@@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
}
/*
- * Mark any slot that points to an LSN older than the given segment
- * as invalid; it requires WAL that's about to be removed.
+ * Invalidate slots that require resources about to be removed.
*
* Returns true when any slot have got invalidated.
*
+ * Whether a slot needs to be invalidated depends on the cause. A slot is
+ * removed if it:
+ * - RS_INVAL_WAL: requires a LSN older than the given segment
+ * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the given db
+     dboid may be InvalidOid for shared relations
the comma above reduces readability

is this what you mean?

RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
db; dboid may be InvalidOid for shared relations

Yep.

@@ -1443,7 +1443,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
slotname, restart_lsn,
oldestLSN, snapshotConflictHorizon);
-				(void) kill(active_pid, SIGTERM);
+				if (MyBackendType == B_STARTUP)
Is SendProcSignal() marked warn_unused_result or something? I don't see
other callers who don't use its return value void casting it.

I went back and forth about it. I think Bertrand added. It looks a bit odd to
have it, for the reason you say. It also looks a bit odd to not have, given
the parallel (void) kill().

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9f56b4e95cf..3b5d654347e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -24,6 +24,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/slot.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
@@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
*/
void
ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
+									bool isCatalogRel,
RelFileLocator locator)
{
VirtualTransactionId *backends;
@@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
true);
+
+	/*
+	 * Note that WaitExceedsMaxStandbyDelay() is not taken into account here
+	 * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That
+	 * seems OK, given that this kind of conflict should not normally be

do you mean "when using a physical replication slot"?

+	 * reached, e.g. by using a physical replication slot.
+	 */
+	if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
+		InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, locator.dbOid,
+										   snapshotConflictHorizon);
}

No. I mean that normally a physical replication slot, or some other approach,
should prevent such conflicts. I replaced 'by' with 'due to'

Thanks a lot for the review!

Greetings,

Andres Freund

#346

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Andres Freund (#337)

Re: Minimal logical decoding on standbys

On Fri, Apr 7, 2023 at 11:42 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-07 08:47:57 -0700, Andres Freund wrote:

Integrated all of these.

Here's my current version. Changes:
- Integrated Bertrand's changes
- polished commit messages of 0001-0003
- edited code comments for 0003, including
InvalidateObsoleteReplicationSlots()'s header
- added a bump of SLOT_VERSION to 0001
- moved addition of pg_log_standby_snapshot() to 0007
- added a catversion bump for pg_log_standby_snapshot()
- moved all the bits dealing with procsignals from 0003 to 0004, now the split
makes sense IMO
- combined a few more sucessive ->safe_psql() calls

The new approach for invalidation looks clean. BTW, I see minor
inconsistency in the following two error messages (errmsg):

if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer get changes from replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it exceeded the
maximum reserved size.")));

if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot read from logical replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it was conflicting
with recovery.")));

Won't it be better to keep the same errmsg in the above two cases?

--
With Regards,
Amit Kapila.

#347

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Amit Kapila (#346)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-08 09:15:05 +0530, Amit Kapila wrote:

The new approach for invalidation looks clean. BTW, I see minor
inconsistency in the following two error messages (errmsg):

Thanks for checking.

if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer get changes from replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it exceeded the
maximum reserved size.")));

if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot read from logical replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it was conflicting
with recovery.")));

Won't it be better to keep the same errmsg in the above two cases?

Probably - do you have a preference? I think the former is a bit better?

Greetings,

Andres Freund

#348

Jonathan S. Katz

jkatz@postgresql.org

almost 3 years ago

In reply to: Andres Freund (#347)

Re: Minimal logical decoding on standbys

On 4/8/23 12:01 AM, Andres Freund wrote:

Hi,

On 2023-04-08 09:15:05 +0530, Amit Kapila wrote:

The new approach for invalidation looks clean. BTW, I see minor
inconsistency in the following two error messages (errmsg):

Thanks for checking.

if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer get changes from replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it exceeded the
maximum reserved size.")));

if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot read from logical replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it was conflicting
with recovery.")));

Won't it be better to keep the same errmsg in the above two cases?

Probably - do you have a preference? I think the former is a bit better?

+1 for the former, though perhaps "receive" instead of "get?"

Jonathan

#349

Amit Kapila

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Andres Freund (#347)

Re: Minimal logical decoding on standbys

On Sat, Apr 8, 2023 at 9:31 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-04-08 09:15:05 +0530, Amit Kapila wrote:

The new approach for invalidation looks clean. BTW, I see minor
inconsistency in the following two error messages (errmsg):

Thanks for checking.

if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer get changes from replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it exceeded the
maximum reserved size.")));

if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot read from logical replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it was conflicting
with recovery.")));

Won't it be better to keep the same errmsg in the above two cases?

Probably - do you have a preference? I think the former is a bit better?

+1 for the former.

--
With Regards,
Amit Kapila.

#350

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#337)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 11:12:26 -0700, Andres Freund wrote:

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
</tbody>
</tgroup>
</table>

This seems wrong to me. The counter is not for invalidated slots, it's for
recovery conflict interrupts. If phrased that way, the parenthetical would be
unnecessary.

I think something like
Number of uses of logical slots in this database that have been
canceled due to old snapshots or a too low <xref linkend="guc-wal-level"/>
on the primary

would work and fit with the documentation of the other fields? Reads a bit
stilted, but so do several of the other fields...

Greetings,

Andres Freund

#351

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Andres Freund (#350)

Re: Minimal logical decoding on standbys

Hi,

New wording works for me, thanks!

Bertrand

Le sam. 8 avr. 2023, 08:26, Andres Freund <andres@anarazel.de> a écrit :

Show quoted text

Hi,

On 2023-04-07 11:12:26 -0700, Andres Freund wrote:
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>confl_active_logicalslot</structfield>
<type>bigint</type>
+      </para>
+      <para>
+       Number of active logical slots in this database that have been
+       invalidated because they conflict with recovery (note that
inactive ones
+       are also invalidated but do not increment this counter)
+      </para></entry>
+     </row>
</tbody>
</tgroup>
</table>
This seems wrong to me. The counter is not for invalidated slots, it's for
recovery conflict interrupts. If phrased that way, the parenthetical would
be
unnecessary.

I think something like
Number of uses of logical slots in this database that have been
canceled due to old snapshots or a too low <xref
linkend="guc-wal-level"/>
on the primary

would work and fit with the documentation of the other fields? Reads a bit
stilted, but so do several of the other fields...

Greetings,

Andres Freund

#352

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#342)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-07 14:27:09 -0700, Andres Freund wrote:

I think I'll push these in a few hours. While this needed more changes than
I'd like shortly before the freeze, I think they're largely not in very
interesting bits and pieces - and this feature has been in the works for about
three eternities, and it is blocking a bunch of highly requested features.

If anybody still has energy, I would appreciate a look at 0001, 0002, the new
pieces I added, to make what's now 0003 and 0004 cleaner.

Pushed. Thanks all!

I squashed some of the changes. There didn't seem to be a need for a separate
stats commit, or a separate docs commit. Besides that, I did find plenty of
grammar issues, and a bunch of formatting issues.

Let's see what the buildfarm says.

Greetings,

Andres Freund

#353

Jonathan S. Katz

jkatz@postgresql.org

almost 3 years ago

In reply to: Andres Freund (#352)

Re: Minimal logical decoding on standbys

On 4/8/23 5:27 AM, Andres Freund wrote:

Hi,

On 2023-04-07 14:27:09 -0700, Andres Freund wrote:

I think I'll push these in a few hours. While this needed more changes than
I'd like shortly before the freeze, I think they're largely not in very
interesting bits and pieces - and this feature has been in the works for about
three eternities, and it is blocking a bunch of highly requested features.

If anybody still has energy, I would appreciate a look at 0001, 0002, the new
pieces I added, to make what's now 0003 and 0004 cleaner.

Pushed. Thanks all!

I squashed some of the changes. There didn't seem to be a need for a separate
stats commit, or a separate docs commit. Besides that, I did find plenty of
grammar issues, and a bunch of formatting issues.

Let's see what the buildfarm says.

Thanks to everyone for working on this feature -- this should have a big
impact on users of logical replication!

While it still needs to get through the beta period etc. this is a big
milestone for what's been a multi-year effort to support this.

Thanks,

Jonathan

#354

Noah Misch

noah@leadboat.com

almost 3 years ago

In reply to: Andres Freund (#337)

Re: Minimal logical decoding on standbys

On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:

--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,720 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.

...

+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});

Buildfarm member hoverfly stopped reporting in when this test joined the tree.
It's currently been stuck here for 140 minutes:

===
$ tail -n5 regress_log_035_standby_logical_decoding
[02:57:48.390](0.100s) ok 66 - otherslot on standby not dropped

### Reloading node "standby"
# Running: pg_ctl -D /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_035_standby_logical_decoding_standby_data/pgdata reload
server signaled
===

I've posted a tarball of the current logs at
https://drive.google.com/file/d/1JIZ5hSHBsKjEgU5WOGHOqXB7Z_-9XT5u/view?usp=sharing.
The test times out (PG_TEST_TIMEOUT_DEFAULT=5400), and uploading logs then
fails with 413 Request Entity Too Large. Is the above
log_min_messages='debug2' important? Removing that may make the logs small
enough to upload normally.

#355

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Noah Misch (#354)

Re: Minimal logical decoding on standbys

Hi,

On 4/11/23 7:36 AM, Noah Misch wrote:

On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,720 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
...
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
Buildfarm member hoverfly stopped reporting in when this test joined the tree.
It's currently been stuck here for 140 minutes:

Thanks for the report!

It's looping on:

2023-04-11 02:57:52.516 UTC [62718288:5] 035_standby_logical_decoding.pl LOG: 00000: statement: SELECT restart_lsn IS NOT NULL
FROM pg_catalog.pg_replication_slots WHERE slot_name = 'promotion_inactiveslot'

And the reason is that the slot is not being created:

$ grep "CREATE_REPLICATION_SLOT" 035_standby_logical_decoding_standby.log | tail -2
2023-04-11 02:57:47.287 UTC [9241178:15] 035_standby_logical_decoding.pl STATEMENT: CREATE_REPLICATION_SLOT "otherslot" LOGICAL "test_decoding" ( SNAPSHOT 'nothing')
2023-04-11 02:57:47.622 UTC [9241178:23] 035_standby_logical_decoding.pl STATEMENT: CREATE_REPLICATION_SLOT "otherslot" LOGICAL "test_decoding" ( SNAPSHOT 'nothing')

Not sure why the slot is not being created.

There is also "replication apply delay" increasing:

2023-04-11 02:57:49.183 UTC [13304488:253] DEBUG: 00000: sendtime 2023-04-11 02:57:49.111363+00 receipttime 2023-04-11 02:57:49.183512+00 replication apply delay 644 ms transfer latency 73 ms
2023-04-11 02:57:49.184 UTC [13304488:259] DEBUG: 00000: sendtime 2023-04-11 02:57:49.183461+00 receipttime 2023-04-11 02:57:49.1842+00 replication apply delay 645 ms transfer latency 1 ms
2023-04-11 02:57:49.221 UTC [13304488:265] DEBUG: 00000: sendtime 2023-04-11 02:57:49.184166+00 receipttime 2023-04-11 02:57:49.221059+00 replication apply delay 682 ms transfer latency 37 ms
2023-04-11 02:57:49.222 UTC [13304488:271] DEBUG: 00000: sendtime 2023-04-11 02:57:49.221003+00 receipttime 2023-04-11 02:57:49.222144+00 replication apply delay 683 ms transfer latency 2 ms
2023-04-11 02:57:49.222 UTC [13304488:277] DEBUG: 00000: sendtime 2023-04-11 02:57:49.222095+00 receipttime 2023-04-11 02:57:49.2228+00 replication apply delay 684 ms transfer latency 1 ms

Noah, I think hoverfly is yours, would it be possible to have access (I'm not an AIX expert though) or check if you see a slot creation hanging and if so why?

===
$ tail -n5 regress_log_035_standby_logical_decoding
[02:57:48.390](0.100s) ok 66 - otherslot on standby not dropped

### Reloading node "standby"
# Running: pg_ctl -D /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_035_standby_logical_decoding_standby_data/pgdata reload
server signaled
===

I've posted a tarball of the current logs at
https://drive.google.com/file/d/1JIZ5hSHBsKjEgU5WOGHOqXB7Z_-9XT5u/view?usp=sharing.
The test times out (PG_TEST_TIMEOUT_DEFAULT=5400), and uploading logs then
fails with 413 Request Entity Too Large. Is the above
log_min_messages='debug2' important? Removing that may make the logs small
enough to upload normally.

I think debug2 might still be useful while investigating this issue (I'll compare a working TAP run with this one).

Regards

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#356

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#355)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/11/23 10:20 AM, Drouvot, Bertrand wrote:

Hi,

On 4/11/23 7:36 AM, Noah Misch wrote:
On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
--- /dev/null
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -0,0 +1,720 @@
+# logical decoding on standby : test logical decoding,
+# recovery conflict and standby promotion.
...
+$node_primary->append_conf('postgresql.conf', q{
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+log_min_messages = 'debug2'
+log_error_verbosity = verbose
+});
Buildfarm member hoverfly stopped reporting in when this test joined the tree.
It's currently been stuck here for 140 minutes:
Thanks for the report!

It's looping on:

2023-04-11 02:57:52.516 UTC [62718288:5] 035_standby_logical_decoding.pl LOG: 00000: statement: SELECT restart_lsn IS NOT NULL
FROM pg_catalog.pg_replication_slots WHERE slot_name = 'promotion_inactiveslot'

And the reason is that the slot is not being created:

$ grep "CREATE_REPLICATION_SLOT" 035_standby_logical_decoding_standby.log | tail -2
2023-04-11 02:57:47.287 UTC [9241178:15] 035_standby_logical_decoding.pl STATEMENT: CREATE_REPLICATION_SLOT "otherslot" LOGICAL "test_decoding" ( SNAPSHOT 'nothing')
2023-04-11 02:57:47.622 UTC [9241178:23] 035_standby_logical_decoding.pl STATEMENT: CREATE_REPLICATION_SLOT "otherslot" LOGICAL "test_decoding" ( SNAPSHOT 'nothing')

Not sure why the slot is not being created.

There is also "replication apply delay" increasing:

2023-04-11 02:57:49.183 UTC [13304488:253] DEBUG: 00000: sendtime 2023-04-11 02:57:49.111363+00 receipttime 2023-04-11 02:57:49.183512+00 replication apply delay 644 ms transfer latency 73 ms
2023-04-11 02:57:49.184 UTC [13304488:259] DEBUG: 00000: sendtime 2023-04-11 02:57:49.183461+00 receipttime 2023-04-11 02:57:49.1842+00 replication apply delay 645 ms transfer latency 1 ms
2023-04-11 02:57:49.221 UTC [13304488:265] DEBUG: 00000: sendtime 2023-04-11 02:57:49.184166+00 receipttime 2023-04-11 02:57:49.221059+00 replication apply delay 682 ms transfer latency 37 ms
2023-04-11 02:57:49.222 UTC [13304488:271] DEBUG: 00000: sendtime 2023-04-11 02:57:49.221003+00 receipttime 2023-04-11 02:57:49.222144+00 replication apply delay 683 ms transfer latency 2 ms
2023-04-11 02:57:49.222 UTC [13304488:277] DEBUG: 00000: sendtime 2023-04-11 02:57:49.222095+00 receipttime 2023-04-11 02:57:49.2228+00 replication apply delay 684 ms transfer latency 1 ms

Noah, I think hoverfly is yours, would it be possible to have access (I'm not an AIX expert though) or check if you see a slot creation hanging and if so why?

Well, we can see in 035_standby_logical_decoding_standby.log:

2023-04-11 02:57:49.180 UTC [62718258:5] [unknown] FATAL: 3D000: database "testdb" does not exist

While, on the primary:

2023-04-11 02:57:48.505 UTC [62718254:5] 035_standby_logical_decoding.pl LOG: 00000: statement: CREATE DATABASE testdb

The TAP test is doing:

"
##################################################
# Test standby promotion and logical decoding behavior
# after the standby gets promoted.
##################################################

$node_standby->reload;

$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);

# create the logical slots
create_logical_slots($node_standby, 'promotion_');
"

I think we might want to add:

$node_primary->wait_for_replay_catchup($node_standby);

before calling the slot creation.

It's done in the attached, would it be possible to give it a try please?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

hoverfly.patchtext/plain; charset=UTF-8; name=hoverfly.patchDownload

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index ba98a18bd2..ad845aee28 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -653,6 +653,9 @@ $node_standby->reload;
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 $node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
+# Wait for the standby to catchup before creating the slots
+$node_primary->wait_for_replay_catchup($node_standby);
+
 # create the logical slots
 create_logical_slots($node_standby, 'promotion_');

#357

Drouvot, Bertrand

bertranddrouvot.pg@gmail.com

almost 3 years ago

In reply to: Drouvot, Bertrand (#356)

1 attachment(s)

Re: Minimal logical decoding on standbys

Hi,

On 4/11/23 10:55 AM, Drouvot, Bertrand wrote:

Hi,

I think we might want to add:

$node_primary->wait_for_replay_catchup($node_standby);

before calling the slot creation.

It's done in the attached, would it be possible to give it a try please?

Actually, let's also wait for the cascading standby to catchup too, like done in the attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

hoverfly2.patchtext/plain; charset=UTF-8; name=hoverfly2.patchDownload

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index ba98a18bd2..94a8384c31 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -653,9 +653,15 @@ $node_standby->reload;
 $node_primary->psql('postgres', q[CREATE DATABASE testdb]);
 $node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);
 
+# Wait for the standby to catchup before creating the slots
+$node_primary->wait_for_replay_catchup($node_standby);
+
 # create the logical slots
 create_logical_slots($node_standby, 'promotion_');
 
+# Wait for the cascading standby to catchup before creating the slots
+$node_standby->wait_for_replay_catchup($node_cascading_standby, $node_primary);
+
 # create the logical slots on the cascading standby too
 create_logical_slots($node_cascading_standby, 'promotion_');

#358

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Drouvot, Bertrand (#357)

Re: Minimal logical decoding on standbys

Hi,

On 2023-04-11 11:04:50 +0200, Drouvot, Bertrand wrote:

On 4/11/23 10:55 AM, Drouvot, Bertrand wrote:

Hi,

I think we might want to add:

$node_primary->wait_for_replay_catchup($node_standby);

before calling the slot creation.

It's done in the attached, would it be possible to give it a try please?

Actually, let's also wait for the cascading standby to catchup too, like done in the attached.

Pushed. Seems like a clear race in the test, so I didn't think it was worth
waiting for testing it on hoverfly.

I think we should lower the log level, but perhaps wait for a few more cycles
in case there are random failures?

I wonder if we should make the connections in poll_query_until to reduce
verbosity - it's pretty annoying how much that can bloat the log. Perhaps also
introduce some backoff? It's really annoying to have to trawl through all
those logs when there's a problem.

Greetings,

Andres Freund

#359

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Andres Freund (#358)

Re: Minimal logical decoding on standbys

Andres Freund <andres@anarazel.de> writes:

I think we should lower the log level, but perhaps wait for a few more cycles
in case there are random failures?

Removing

-log_min_messages = 'debug2'
-log_error_verbosity = verbose

not only reduces 035's log output volume from 1.6MB to 260kB,
but also speeds it up nontrivially: on my machine it takes
about 8.50 sec as of HEAD, but 8.09 sec after silencing the
extra logging. So I definitely want to see that out of there.

regards, tom lane

#360

Noah Misch

noah@leadboat.com

almost 3 years ago

In reply to: Andres Freund (#358)

Re: Minimal logical decoding on standbys

On Tue, Apr 11, 2023 at 01:10:57PM -0700, Andres Freund wrote:

On 2023-04-11 11:04:50 +0200, Drouvot, Bertrand wrote:

On 4/11/23 10:55 AM, Drouvot, Bertrand wrote:

I think we might want to add:

$node_primary->wait_for_replay_catchup($node_standby);

before calling the slot creation.

Pushed. Seems like a clear race in the test, so I didn't think it was worth
waiting for testing it on hoverfly.

We'll see what happens in the next run.

I think we should lower the log level, but perhaps wait for a few more cycles
in case there are random failures?

Fine with me.

I wonder if we should make the connections in poll_query_until to reduce
verbosity - it's pretty annoying how much that can bloat the log. Perhaps also
introduce some backoff? It's really annoying to have to trawl through all
those logs when there's a problem.

Agreed. My ranked wish list for poll_query_until is:

1. Exponential backoff
2. Closed-loop time control via Time::HiRes or similar, instead of assuming
that ten loops complete in ~1s. I've seen the loop take 3x as long as the
intended timeout.
3. Connect less often than today's once per probe